Murphy, Chapters 10–11

Linear Models: Classification & Regression

The sigmoid draws a decision boundary. Least squares fits a line. Together, logistic and linear regression are the foundation of every ML pipeline.

Prerequisites: Probability (Ch 2–3) + Statistics (Ch 4). You need MLE, gradient descent, and regularization.

Chapters

Simulations

Quizzes

Chapter 0: Why Linear Models?

You receive an email. Is it spam? A patient has a tumor. Is it malignant? A house has 3 bedrooms and 2 bathrooms. What is it worth? These are classification and regression — the two pillars of supervised learning.

Linear models answer both: logistic regression for classification, linear regression for prediction. They are simple, interpretable, and often surprisingly effective. Even deep neural networks are built from layers of linear transformations.

The big picture: A linear model computes f(x) = w^Tx + b. For regression, this is the prediction directly. For classification, we pass it through the sigmoid to get a probability: p(y=1 | x) = σ(w^Tx + b). The weight vector w tells us the direction of the decision boundary, and its magnitude controls confidence.

Logistic Regression

Binary & multiclass classification via sigmoid/softmax

↓

Linear Regression

OLS, ridge, lasso, closed-form and gradient solutions

↓

Regularization

ℓ₂ (ridge) vs ℓ₁ (lasso) and their Bayesian interpretation

↓

Bayesian Linear Regression

Full posterior over weights, predictive uncertainty

What makes a model "linear"?

It can only model straight lines The prediction is a linear function of the parameters (w^Tx + b), though the input features can be nonlinear transforms It uses linear algebra

Chapter 1: Logistic Regression

Binary classification: given features x, predict y ∈ {0, 1}. We need a probability p(y=1 | x). Raw linear output w^Tx is unbounded, so we squeeze it through the sigmoid function:

p(y = 1 | x, w) = σ(w^Tx + b) = 1 / (1 + e^{−(w^Tx + b)})

The quantity a = w^Tx + b is the logit (log-odds). The sigmoid maps it to [0, 1]. When a = 0, the probability is exactly 0.5 — this is the decision boundary.

Why sigmoid? If you model the log-odds as a linear function, log(p/(1−p)) = w^Tx, then inverting gives p = σ(w^Tx). The sigmoid is not an arbitrary choice — it is the canonical link function for the Bernoulli distribution in the exponential family (Chapter 3).

The likelihood for a single data point is Bernoulli:

p(y | x, w) = μ^y(1 − μ)^1−y where μ = σ(w^Tx)

The NLL over the dataset is the binary cross-entropy:

NLL(w) = −(1/N) ∑_n [y_n log μ_n + (1−y_n) log(1−μ_n)]

This is a smooth, convex function of w — there is a unique global minimum.

What does the sigmoid function do in logistic regression?

It maps the unbounded linear output to a probability in [0, 1] It makes the model nonlinear in the parameters It normalizes the input features

Chapter 2: Decision Boundaries

The weight vector w defines a hyperplane in feature space. Points on one side are classified as 1, the other as 0. The boundary itself is where σ(w^Tx + b) = 0.5, which means w^Tx + b = 0.

In 2D, this is a line. In 3D, a plane. In D dimensions, a hyperplane. The vector w is perpendicular to this boundary, pointing toward the positive class. Its magnitude ||w|| controls how steeply the probability changes as you cross the boundary.

Linear separability: If a line (or hyperplane) can perfectly separate the two classes, the data is linearly separable. In that case, the MLE for w diverges to infinity (pushing the sigmoid toward a step function). This is why regularization matters — we need to bound the weights.

We can handle nonlinearly separable data by applying a feature transformation φ(x). For example, φ(x₁, x₂) = [1, x₁², x₂²] transforms a circular boundary into a linear one in the new space. The model is still linear in parameters: w^Tφ(x).

Polynomial expansion: Using φ(x) = [1, x, x², ..., x^K] gives a degree-K polynomial decision boundary. Murphy's Figure 10.4 shows that too high a degree leads to overfitting — just like polynomial regression in Chapter 4.

What geometric object is the decision boundary of logistic regression in 2D?

A straight line where w^Tx + b = 0 A circle A parabola

Chapter 3: Training Logistic Regression

There is no closed-form solution for logistic regression. We must use iterative optimization. The gradient of the NLL has an elegant form:

∇_w NLL(w) = (1/N) ∑_n (μ_n − y_n) x_n

where μ_n = σ(w^Tx_n) is the predicted probability. The term (μ_n − y_n) is the error signal: how far the prediction is from the truth. Each input x_n is weighted by its error.

Intuition: If μ_n = 0.9 but y_n = 0, the error is +0.9. The gradient pushes w in the direction of −0.9 x_n, reducing the score for this point. If μ_n ≈ y_n, the error is near zero, and the gradient ignores that point. This is how the model learns to pay attention to its mistakes.

Stochastic gradient descent (SGD) updates weights using one (or a few) examples at a time:

w_t+1 = w_t − η_t (μ_n − y_n) x_n

Since the NLL is convex (the Hessian X^TSX is positive definite), SGD converges to the global optimum. Newton's method (using the Hessian) converges faster, called iteratively reweighted least squares (IRLS), but costs more per step.

The perceptron: If we replace the sigmoid with a hard threshold, σ(a) → H(a), we get the perceptron (Rosenblatt, 1958). Its update rule is identical to SGD for logistic regression, but with hard 0/1 predictions instead of soft probabilities. The perceptron only converges when the data is linearly separable; logistic regression always converges.

Why does SGD for logistic regression always converge to the global optimum?

Because the NLL is a convex function of w (the Hessian is positive semi-definite) Because the sigmoid is bounded Because we use a large learning rate

Chapter 4: Multiclass & Softmax

For C > 2 classes, we replace the sigmoid with the softmax function:

p(y = c | x, W) = exp(a_c) / ∑_k=1^C exp(a_k)

where a_c = w_c^Tx + b_c is the logit for class c. We now have C weight vectors (one per class), arranged as rows of a matrix W.

Softmax normalizes logits to probabilities. It exponentiates each logit (making them positive), then divides by their sum (making them sum to 1). The class with the largest logit gets the highest probability. Temperature scaling (dividing logits by T before softmax) controls how "peaked" the distribution is.

The loss is the categorical cross-entropy:

NLL(W) = −(1/N) ∑_n ∑_c y_nc log p(y_n = c | x_n)

where y_nc is 1 if example n belongs to class c (one-hot encoding). This is still convex in W, so gradient descent finds the global optimum.

Aspect	Binary (Sigmoid)	Multiclass (Softmax)
Output	Single probability p ∈ [0,1]	Vector of C probabilities summing to 1
Parameters	One weight vector w	C weight vectors (matrix W)
Loss	Binary cross-entropy	Categorical cross-entropy
Activation	σ(a) = 1/(1+e^−a)	softmax(a)_c = e^a_c/∑e^a_k

What does the softmax function guarantee about its outputs?

All outputs are non-negative and sum to 1 (a valid probability distribution) All outputs are between -1 and 1 The largest output is always 1

Chapter 5: Linear Regression

Now for prediction of continuous values. We model the output as a Gaussian centered on a linear function of the input:

p(y | x, w, σ²) = N(y | w^Tx, σ²)

Maximizing the log-likelihood is equivalent to minimizing the residual sum of squares:

RSS(w) = ∑_n (y_n − w^Tx_n)² = ||Xw − y||²

This has a beautiful closed-form solution, the ordinary least squares (OLS) estimate:

ŵ_ols = (X^TX)⁻¹X^Ty

The normal equations: Setting ∇_wRSS = 0 gives X^TXw = X^Ty. This is a system of D linear equations in D unknowns. The matrix X^TX is the Gram matrix (sum of outer products of features). When it is invertible, the solution is unique.

Key properties of OLS:

Property	Details
Existence	Always exists (minimizer of a convex function)
Uniqueness	Unique when X^TX is invertible (N ≥ D)
Gauss-Markov	Best Linear Unbiased Estimator (BLUE) among all linear estimators
Residual noise	σ̂² = RSS(ŵ) / (N − D) (unbiased)

What loss function does ordinary least squares minimize?

The sum of squared residuals (differences between predicted and actual values) The sum of absolute residuals The cross-entropy loss

Chapter 6: Ridge & Lasso

When features are correlated or D > N, OLS overfits or is undefined. We need regularization.

Ridge regression (ℓ₂ penalty) adds a quadratic penalty on the weights:

ŵ_ridge = argmin_w ||Xw − y||² + λ||w||²₂ = (X^TX + λI)⁻¹X^Ty

The λI term makes X^TX + λI always invertible. It is equivalent to MAP estimation with a Gaussian prior p(w) = N(0, σ²/λ I).

Ridge (ℓ₂): Shrinks all weights toward zero but never exactly to zero. Equivalent to a Gaussian prior. Smooth, differentiable. Good when all features are potentially relevant.

Lasso (ℓ₁): Drives some weights exactly to zero, performing feature selection. Equivalent to a Laplace prior. Non-differentiable at zero. Good when you suspect only a few features matter.

ŵ_lasso = argmin_w ||Xw − y||² + λ ∑_d |w_d|

Geometric intuition: Ridge constrains w to lie inside a sphere (||w||² ≤ r). Lasso constrains w to lie inside a diamond (||w||₁ ≤ r). The diamond has corners on the axes, so the constraint often intersects the loss contours at a corner — where some w_d = 0. This is why lasso produces sparse solutions.

Elastic net combines both: λ₁||w||₁ + λ₂||w||²₂. This gives sparsity plus grouping of correlated features.

Why does lasso (ℓ₁) produce sparse weights but ridge (ℓ₂) does not?

The ℓ₁ ball has corners on the axes, so loss contours tend to intersect at points where some weights are exactly zero Lasso uses a larger regularization parameter Ridge is non-differentiable

Chapter 7: Bayesian Linear Regression

Instead of finding a single ŵ, the Bayesian approach computes the full posterior p(w | D). For linear regression with a Gaussian prior and Gaussian likelihood, the posterior is also Gaussian (conjugacy):

p(w | D) = N(w | w_N, V_N)

V_N = (σ⁻² X^TX + V₀⁻¹)⁻¹

w_N = V_N(σ⁻² X^Ty + V₀⁻¹w₀)

The posterior mean w_N is a weighted combination of the prior mean and the OLS estimate. The posterior covariance V_N tells us how uncertain we are about each weight.

The posterior predictive: For a new input x_*, the predictive distribution is also Gaussian: p(y_* | x_*, D) = N(y_* | w_N^Tx_*, σ_*²), where σ_*² = σ² + x_*^TV_Nx_*. The first term is noise variance. The second is epistemic uncertainty — uncertainty about w. Far from training data, the epistemic term grows and predictions widen.

This is exactly what Gaussian processes (Chapter 17) generalize to infinite-dimensional feature spaces. Bayesian linear regression is a GP with a linear kernel.

What does the epistemic uncertainty term x_*^TV_Nx_* capture?

Uncertainty about the model parameters, which is large far from training data Noise in the data The mean prediction

Chapter 8: Decision Boundary Explorer

Watch logistic regression learn a decision boundary in real time. Click to place positive (teal) and negative (orange) points, then hit Train to see the boundary emerge.

Logistic Regression: 2D Classifier

Click to place points (toggle class below). Then click Train. The heatmap shows p(y=1 | x). The line is the decision boundary (p=0.5).

0 points

What to observe: With well-separated clusters, the boundary is confident (sharp color transition). With overlapping clusters, it is uncertain (gradual transition). Try placing points in a non-linearly-separable pattern (like a circle) to see the limitation of linear models.

What happens to the logistic regression decision boundary as the weight magnitude ||w|| increases?

The transition from class 0 to class 1 becomes sharper (more confident predictions) The boundary moves further from the origin The boundary becomes curved

Chapter 9: Regression Playground

Explore how regularization affects the fitted line. We generate noisy data from a true function and fit linear models with varying regularization.

Ridge Regression: Regularization Effect

Data is generated from a degree-3 polynomial with noise. We fit a degree-6 polynomial using ridge regression. Increase λ to see the fit become smoother. Gray dots = data, orange dashed = true function, teal = fitted curve.

log₁₀(λ)1.0

Noise σ0.5

What happens when λ is too large in ridge regression?

The model overfits the training data The model underfits, producing an overly simple (flat) prediction The model becomes more flexible

Chapter 10: Connections

Linear models are the building blocks of everything that follows in Murphy's book.

Concept from this chapter	Where it leads
Logistic regression	Output layer of neural networks (Ch 13), GLMs (Ch 12)
Softmax	Final layer of classification DNNs, attention mechanisms
Cross-entropy loss	Standard training loss for all neural networks
Linear regression	Gaussian processes (Ch 17), Bayesian methods
Ridge (ℓ₂)	Weight decay in neural networks (Ch 13)
Lasso (ℓ₁)	Sparse models, compressed sensing
SGD for logistic regression	Same algorithm trains every DNN
Bayesian linear regression	Gaussian processes are the infinite-width limit

What we covered: Logistic regression (binary and multiclass), decision boundaries, sigmoid and softmax, MLE training via SGD and Newton's method, linear regression with OLS, ridge and lasso regularization, and Bayesian linear regression with posterior predictive uncertainty.

What comes next: Chapters 13–15 break the linearity assumption. Neural networks stack linear layers with nonlinear activations to learn arbitrarily complex functions. But the training algorithm (SGD on cross-entropy/MSE) is exactly what we developed here.

"Essentially, all models are wrong, but some are useful." — George Box

How is the softmax layer at the end of a neural network related to logistic regression?

The softmax output layer is logistic regression applied to the learned features of the hidden layers They are completely unrelated Softmax replaces the need for logistic regression