Murphy, Chapters 10–11

Linear Models: Classification & Regression

The sigmoid draws a decision boundary. Least squares fits a line. Together, logistic and linear regression are the foundation of every ML pipeline.

Prerequisites: Probability (Ch 2–3) + Statistics (Ch 4). You need MLE, gradient descent, and regularization.
11
Chapters
5
Simulations
11
Quizzes

Chapter 0: Why Linear Models?

You receive an email. Is it spam? A patient has a tumor. Is it malignant? A house has 3 bedrooms and 2 bathrooms. What is it worth? These are classification and regression — the two pillars of supervised learning.

Linear models answer both: logistic regression for classification, linear regression for prediction. They are simple, interpretable, and often surprisingly effective. Even deep neural networks are built from layers of linear transformations.

The big picture: A linear model computes f(x) = wTx + b. For regression, this is the prediction directly. For classification, we pass it through the sigmoid to get a probability: p(y=1 | x) = σ(wTx + b). The weight vector w tells us the direction of the decision boundary, and its magnitude controls confidence.
Logistic Regression
Binary & multiclass classification via sigmoid/softmax
Linear Regression
OLS, ridge, lasso, closed-form and gradient solutions
Regularization
2 (ridge) vs ℓ1 (lasso) and their Bayesian interpretation
Bayesian Linear Regression
Full posterior over weights, predictive uncertainty
What makes a model "linear"?

Chapter 1: Logistic Regression

Binary classification: given features x, predict y ∈ {0, 1}. We need a probability p(y=1 | x). Raw linear output wTx is unbounded, so we squeeze it through the sigmoid function:

p(y = 1 | x, w) = σ(wTx + b) = 1 / (1 + e−(wTx + b))

The quantity a = wTx + b is the logit (log-odds). The sigmoid maps it to [0, 1]. When a = 0, the probability is exactly 0.5 — this is the decision boundary.

Why sigmoid? If you model the log-odds as a linear function, log(p/(1−p)) = wTx, then inverting gives p = σ(wTx). The sigmoid is not an arbitrary choice — it is the canonical link function for the Bernoulli distribution in the exponential family (Chapter 3).

The likelihood for a single data point is Bernoulli:

p(y | x, w) = μy(1 − μ)1−y  where μ = σ(wTx)

The NLL over the dataset is the binary cross-entropy:

NLL(w) = −(1/N) ∑n [yn log μn + (1−yn) log(1−μn)]

This is a smooth, convex function of w — there is a unique global minimum.

What does the sigmoid function do in logistic regression?

Chapter 2: Decision Boundaries

The weight vector w defines a hyperplane in feature space. Points on one side are classified as 1, the other as 0. The boundary itself is where σ(wTx + b) = 0.5, which means wTx + b = 0.

In 2D, this is a line. In 3D, a plane. In D dimensions, a hyperplane. The vector w is perpendicular to this boundary, pointing toward the positive class. Its magnitude ||w|| controls how steeply the probability changes as you cross the boundary.

Linear separability: If a line (or hyperplane) can perfectly separate the two classes, the data is linearly separable. In that case, the MLE for w diverges to infinity (pushing the sigmoid toward a step function). This is why regularization matters — we need to bound the weights.

We can handle nonlinearly separable data by applying a feature transformation φ(x). For example, φ(x1, x2) = [1, x1², x2²] transforms a circular boundary into a linear one in the new space. The model is still linear in parameters: wTφ(x).

Polynomial expansion: Using φ(x) = [1, x, x², ..., xK] gives a degree-K polynomial decision boundary. Murphy's Figure 10.4 shows that too high a degree leads to overfitting — just like polynomial regression in Chapter 4.
What geometric object is the decision boundary of logistic regression in 2D?

Chapter 3: Training Logistic Regression

There is no closed-form solution for logistic regression. We must use iterative optimization. The gradient of the NLL has an elegant form:

w NLL(w) = (1/N) ∑nn − yn) xn

where μn = σ(wTxn) is the predicted probability. The term (μn − yn) is the error signal: how far the prediction is from the truth. Each input xn is weighted by its error.

Intuition: If μn = 0.9 but yn = 0, the error is +0.9. The gradient pushes w in the direction of −0.9 xn, reducing the score for this point. If μn ≈ yn, the error is near zero, and the gradient ignores that point. This is how the model learns to pay attention to its mistakes.

Stochastic gradient descent (SGD) updates weights using one (or a few) examples at a time:

wt+1 = wt − ηtn − yn) xn

Since the NLL is convex (the Hessian XTSX is positive definite), SGD converges to the global optimum. Newton's method (using the Hessian) converges faster, called iteratively reweighted least squares (IRLS), but costs more per step.

The perceptron: If we replace the sigmoid with a hard threshold, σ(a) → H(a), we get the perceptron (Rosenblatt, 1958). Its update rule is identical to SGD for logistic regression, but with hard 0/1 predictions instead of soft probabilities. The perceptron only converges when the data is linearly separable; logistic regression always converges.
Why does SGD for logistic regression always converge to the global optimum?

Chapter 4: Multiclass & Softmax

For C > 2 classes, we replace the sigmoid with the softmax function:

p(y = c | x, W) = exp(ac) / ∑k=1C exp(ak)

where ac = wcTx + bc is the logit for class c. We now have C weight vectors (one per class), arranged as rows of a matrix W.

Softmax normalizes logits to probabilities. It exponentiates each logit (making them positive), then divides by their sum (making them sum to 1). The class with the largest logit gets the highest probability. Temperature scaling (dividing logits by T before softmax) controls how "peaked" the distribution is.

The loss is the categorical cross-entropy:

NLL(W) = −(1/N) ∑nc ync log p(yn = c | xn)

where ync is 1 if example n belongs to class c (one-hot encoding). This is still convex in W, so gradient descent finds the global optimum.

AspectBinary (Sigmoid)Multiclass (Softmax)
OutputSingle probability p ∈ [0,1]Vector of C probabilities summing to 1
ParametersOne weight vector wC weight vectors (matrix W)
LossBinary cross-entropyCategorical cross-entropy
Activationσ(a) = 1/(1+e−a)softmax(a)c = eac/∑eak
What does the softmax function guarantee about its outputs?

Chapter 5: Linear Regression

Now for prediction of continuous values. We model the output as a Gaussian centered on a linear function of the input:

p(y | x, w, σ²) = N(y | wTx, σ²)

Maximizing the log-likelihood is equivalent to minimizing the residual sum of squares:

RSS(w) = ∑n (yn − wTxn)² = ||Xw − y||²

This has a beautiful closed-form solution, the ordinary least squares (OLS) estimate:

ols = (XTX)−1XTy
The normal equations: Setting ∇wRSS = 0 gives XTXw = XTy. This is a system of D linear equations in D unknowns. The matrix XTX is the Gram matrix (sum of outer products of features). When it is invertible, the solution is unique.

Key properties of OLS:

PropertyDetails
ExistenceAlways exists (minimizer of a convex function)
UniquenessUnique when XTX is invertible (N ≥ D)
Gauss-MarkovBest Linear Unbiased Estimator (BLUE) among all linear estimators
Residual noiseσ̂² = RSS(ŵ) / (N − D) (unbiased)
What loss function does ordinary least squares minimize?

Chapter 6: Ridge & Lasso

When features are correlated or D > N, OLS overfits or is undefined. We need regularization.

Ridge regression (ℓ2 penalty) adds a quadratic penalty on the weights:

ridge = argminw ||Xw − y||² + λ||w||²2 = (XTX + λI)−1XTy

The λI term makes XTX + λI always invertible. It is equivalent to MAP estimation with a Gaussian prior p(w) = N(0, σ²/λ I).

Ridge (ℓ2): Shrinks all weights toward zero but never exactly to zero. Equivalent to a Gaussian prior. Smooth, differentiable. Good when all features are potentially relevant.
Lasso (ℓ1): Drives some weights exactly to zero, performing feature selection. Equivalent to a Laplace prior. Non-differentiable at zero. Good when you suspect only a few features matter.
lasso = argminw ||Xw − y||² + λ ∑d |wd|
Geometric intuition: Ridge constrains w to lie inside a sphere (||w||² ≤ r). Lasso constrains w to lie inside a diamond (||w||1 ≤ r). The diamond has corners on the axes, so the constraint often intersects the loss contours at a corner — where some wd = 0. This is why lasso produces sparse solutions.

Elastic net combines both: λ1||w||1 + λ2||w||²2. This gives sparsity plus grouping of correlated features.

Why does lasso (ℓ1) produce sparse weights but ridge (ℓ2) does not?

Chapter 7: Bayesian Linear Regression

Instead of finding a single ŵ, the Bayesian approach computes the full posterior p(w | D). For linear regression with a Gaussian prior and Gaussian likelihood, the posterior is also Gaussian (conjugacy):

p(w | D) = N(w | wN, VN)
VN = (σ−2 XTX + V0−1)−1
wN = VN−2 XTy + V0−1w0)

The posterior mean wN is a weighted combination of the prior mean and the OLS estimate. The posterior covariance VN tells us how uncertain we are about each weight.

The posterior predictive: For a new input x*, the predictive distribution is also Gaussian: p(y* | x*, D) = N(y* | wNTx*, σ*²), where σ*² = σ² + x*TVNx*. The first term is noise variance. The second is epistemic uncertainty — uncertainty about w. Far from training data, the epistemic term grows and predictions widen.

This is exactly what Gaussian processes (Chapter 17) generalize to infinite-dimensional feature spaces. Bayesian linear regression is a GP with a linear kernel.

What does the epistemic uncertainty term x*TVNx* capture?

Chapter 8: Decision Boundary Explorer

Watch logistic regression learn a decision boundary in real time. Click to place positive (teal) and negative (orange) points, then hit Train to see the boundary emerge.

Logistic Regression: 2D Classifier

Click to place points (toggle class below). Then click Train. The heatmap shows p(y=1 | x). The line is the decision boundary (p=0.5).

0 points
What to observe: With well-separated clusters, the boundary is confident (sharp color transition). With overlapping clusters, it is uncertain (gradual transition). Try placing points in a non-linearly-separable pattern (like a circle) to see the limitation of linear models.
What happens to the logistic regression decision boundary as the weight magnitude ||w|| increases?

Chapter 9: Regression Playground

Explore how regularization affects the fitted line. We generate noisy data from a true function and fit linear models with varying regularization.

Ridge Regression: Regularization Effect

Data is generated from a degree-3 polynomial with noise. We fit a degree-6 polynomial using ridge regression. Increase λ to see the fit become smoother. Gray dots = data, orange dashed = true function, teal = fitted curve.

log10(λ)1.0
Noise σ0.5
What happens when λ is too large in ridge regression?

Chapter 10: Connections

Linear models are the building blocks of everything that follows in Murphy's book.

Concept from this chapterWhere it leads
Logistic regressionOutput layer of neural networks (Ch 13), GLMs (Ch 12)
SoftmaxFinal layer of classification DNNs, attention mechanisms
Cross-entropy lossStandard training loss for all neural networks
Linear regressionGaussian processes (Ch 17), Bayesian methods
Ridge (ℓ2)Weight decay in neural networks (Ch 13)
Lasso (ℓ1)Sparse models, compressed sensing
SGD for logistic regressionSame algorithm trains every DNN
Bayesian linear regressionGaussian processes are the infinite-width limit
What we covered: Logistic regression (binary and multiclass), decision boundaries, sigmoid and softmax, MLE training via SGD and Newton's method, linear regression with OLS, ridge and lasso regularization, and Bayesian linear regression with posterior predictive uncertainty.
What comes next: Chapters 13–15 break the linearity assumption. Neural networks stack linear layers with nonlinear activations to learn arbitrarily complex functions. But the training algorithm (SGD on cross-entropy/MSE) is exactly what we developed here.

"Essentially, all models are wrong, but some are useful." — George Box

How is the softmax layer at the end of a neural network related to logistic regression?