The sigmoid draws a decision boundary. Least squares fits a line. Together, logistic and linear regression are the foundation of every ML pipeline.
You receive an email. Is it spam? A patient has a tumor. Is it malignant? A house has 3 bedrooms and 2 bathrooms. What is it worth? These are classification and regression — the two pillars of supervised learning.
Linear models answer both: logistic regression for classification, linear regression for prediction. They are simple, interpretable, and often surprisingly effective. Even deep neural networks are built from layers of linear transformations.
Binary classification: given features x, predict y ∈ {0, 1}. We need a probability p(y=1 | x). Raw linear output wTx is unbounded, so we squeeze it through the sigmoid function:
The quantity a = wTx + b is the logit (log-odds). The sigmoid maps it to [0, 1]. When a = 0, the probability is exactly 0.5 — this is the decision boundary.
The likelihood for a single data point is Bernoulli:
The NLL over the dataset is the binary cross-entropy:
This is a smooth, convex function of w — there is a unique global minimum.
The weight vector w defines a hyperplane in feature space. Points on one side are classified as 1, the other as 0. The boundary itself is where σ(wTx + b) = 0.5, which means wTx + b = 0.
In 2D, this is a line. In 3D, a plane. In D dimensions, a hyperplane. The vector w is perpendicular to this boundary, pointing toward the positive class. Its magnitude ||w|| controls how steeply the probability changes as you cross the boundary.
We can handle nonlinearly separable data by applying a feature transformation φ(x). For example, φ(x1, x2) = [1, x1², x2²] transforms a circular boundary into a linear one in the new space. The model is still linear in parameters: wTφ(x).
There is no closed-form solution for logistic regression. We must use iterative optimization. The gradient of the NLL has an elegant form:
where μn = σ(wTxn) is the predicted probability. The term (μn − yn) is the error signal: how far the prediction is from the truth. Each input xn is weighted by its error.
Stochastic gradient descent (SGD) updates weights using one (or a few) examples at a time:
Since the NLL is convex (the Hessian XTSX is positive definite), SGD converges to the global optimum. Newton's method (using the Hessian) converges faster, called iteratively reweighted least squares (IRLS), but costs more per step.
For C > 2 classes, we replace the sigmoid with the softmax function:
where ac = wcTx + bc is the logit for class c. We now have C weight vectors (one per class), arranged as rows of a matrix W.
The loss is the categorical cross-entropy:
where ync is 1 if example n belongs to class c (one-hot encoding). This is still convex in W, so gradient descent finds the global optimum.
| Aspect | Binary (Sigmoid) | Multiclass (Softmax) |
|---|---|---|
| Output | Single probability p ∈ [0,1] | Vector of C probabilities summing to 1 |
| Parameters | One weight vector w | C weight vectors (matrix W) |
| Loss | Binary cross-entropy | Categorical cross-entropy |
| Activation | σ(a) = 1/(1+e−a) | softmax(a)c = eac/∑eak |
Now for prediction of continuous values. We model the output as a Gaussian centered on a linear function of the input:
Maximizing the log-likelihood is equivalent to minimizing the residual sum of squares:
This has a beautiful closed-form solution, the ordinary least squares (OLS) estimate:
Key properties of OLS:
| Property | Details |
|---|---|
| Existence | Always exists (minimizer of a convex function) |
| Uniqueness | Unique when XTX is invertible (N ≥ D) |
| Gauss-Markov | Best Linear Unbiased Estimator (BLUE) among all linear estimators |
| Residual noise | σ̂² = RSS(ŵ) / (N − D) (unbiased) |
When features are correlated or D > N, OLS overfits or is undefined. We need regularization.
Ridge regression (ℓ2 penalty) adds a quadratic penalty on the weights:
The λI term makes XTX + λI always invertible. It is equivalent to MAP estimation with a Gaussian prior p(w) = N(0, σ²/λ I).
Elastic net combines both: λ1||w||1 + λ2||w||²2. This gives sparsity plus grouping of correlated features.
Instead of finding a single ŵ, the Bayesian approach computes the full posterior p(w | D). For linear regression with a Gaussian prior and Gaussian likelihood, the posterior is also Gaussian (conjugacy):
The posterior mean wN is a weighted combination of the prior mean and the OLS estimate. The posterior covariance VN tells us how uncertain we are about each weight.
This is exactly what Gaussian processes (Chapter 17) generalize to infinite-dimensional feature spaces. Bayesian linear regression is a GP with a linear kernel.
Watch logistic regression learn a decision boundary in real time. Click to place positive (teal) and negative (orange) points, then hit Train to see the boundary emerge.
Click to place points (toggle class below). Then click Train. The heatmap shows p(y=1 | x). The line is the decision boundary (p=0.5).
Explore how regularization affects the fitted line. We generate noisy data from a true function and fit linear models with varying regularization.
Data is generated from a degree-3 polynomial with noise. We fit a degree-6 polynomial using ridge regression. Increase λ to see the fit become smoother. Gray dots = data, orange dashed = true function, teal = fitted curve.
Linear models are the building blocks of everything that follows in Murphy's book.
| Concept from this chapter | Where it leads |
|---|---|
| Logistic regression | Output layer of neural networks (Ch 13), GLMs (Ch 12) |
| Softmax | Final layer of classification DNNs, attention mechanisms |
| Cross-entropy loss | Standard training loss for all neural networks |
| Linear regression | Gaussian processes (Ch 17), Bayesian methods |
| Ridge (ℓ2) | Weight decay in neural networks (Ch 13) |
| Lasso (ℓ1) | Sparse models, compressed sensing |
| SGD for logistic regression | Same algorithm trains every DNN |
| Bayesian linear regression | Gaussian processes are the infinite-width limit |
"Essentially, all models are wrong, but some are useful." — George Box