Piech, Chapter 15

Machine Learning

From probability to prediction — how Bayes' theorem and MLE become classifiers that learn from data.

Prerequisites: Chapter 14 (Estimation). Bayes' theorem and MLE from earlier chapters.
10
Chapters
4
Simulations
10
Quizzes

Chapter 0: Why ML from Probability?

You receive an email. Is it spam or not? You have a dataset of 1,000 labeled emails — some spam, some legitimate. Each email has features: does it contain the word "free"? Does it contain "meeting"? Does it have an attachment? Your job is to build a function that takes these features and predicts the label.

This is classification — using training data with feature/label pairs (x, y) to learn a function ŷ = g(x) that predicts labels for new, unseen examples. It is the bread and butter of machine learning.

But here is the key insight of this chapter: classification is just applied probability. The optimal classifier picks the label with the highest posterior probability:

ŷ = argmaxy P(Y = y | X = x)

That formula should look familiar. It is Bayes' theorem in action — the same Bayes' theorem from Chapter 3. The entire machinery of probability that we have built — conditional probability, independence, MLE, MAP estimation — now becomes the foundation for making machines learn.

The core idea: Machine learning classification is not some alien technology. It is probability estimation with training data. Every concept you have learned so far — Bayes' theorem, independence assumptions, maximum likelihood — directly powers the two most important classifiers: Naive Bayes and Logistic Regression.

In this chapter, we assume binary classification: labels y ∈ {0, 1} and binary features xj ∈ {0, 1}. This keeps the arithmetic clean while preserving all the important ideas. The extension to more classes and continuous features is straightforward.

SetupDetails
Training dataN pairs (x(i), y(i)) where x is a vector of m binary features
Labelsy ∈ {0, 1} (binary classification)
Featuresxj ∈ {0, 1} for j = 1, ..., m
GoalLearn g(x) = argmaxy P(Y = y | X = x)

The two algorithms we cover take fundamentally different approaches. Naive Bayes models P(X | Y) and uses Bayes' theorem to flip it around (generative). Logistic Regression models P(Y | X) directly (discriminative). Both are powerful, both are rooted in probability, and both will prepare you for neural networks.

Check: What does the optimal classifier maximize for each prediction?

Chapter 1: Generative vs Discriminative

There are two fundamentally different philosophies for building a classifier. To see the difference, consider predicting whether an animal is a dog or a cat based on features like weight and ear shape.

A generative classifier learns what dogs look like and what cats look like separately. It models the joint distribution — P(features | dog) and P(features | cat) — then uses Bayes' theorem to classify. It could even generate synthetic examples of dogs and cats, hence the name.

A discriminative classifier does not bother learning what each class looks like. It directly learns the decision boundary between dogs and cats: P(dog | features). It only cares about the dividing line, not the full picture.

Key insight: Generative models the full story (P(X|Y) and P(Y)), then derives the answer via Bayes' theorem. Discriminative models the answer directly (P(Y|X)). Generative gives you more (you can sample from it), but discriminative often gives you better classification accuracy because it focuses all its capacity on the decision boundary.
Generative (Naive Bayes)
Model P(X|Y) and P(Y), then use Bayes: P(Y|X) ∝ P(X|Y)P(Y)
↓ vs
Discriminative (Logistic Regression)
Model P(Y|X) directly via σ(θTx)

Let's make this concrete with numbers. Suppose we have 100 training emails: 40 spam, 60 not-spam. Each email has two features: contains "free" (X1) and contains "money" (X2).

Generative approach: Estimate P(X1=1 | spam) = 30/40 = 0.75, P(X1=1 | not-spam) = 6/60 = 0.10, etc. Then for a new email with "free" and "money", compute P(spam | features) via Bayes' theorem.

Discriminative approach: Directly find weights θ such that σ(θ0 + θ1x1 + θ2x2) ≈ P(spam | features). No need to model each class separately.

PropertyGenerative (Naive Bayes)Discriminative (Logistic Reg)
ModelsP(X|Y) and P(Y)P(Y|X) directly
TrainingCount frequencies (closed form)Gradient descent (iterative)
With little dataOften better (strong prior)May overfit
With lots of dataLimited by independence assumptionOften better
Can generate samples?YesNo
Check: Naive Bayes is called "generative" because it models:

Chapter 2: The Naive Bayes Classifier

Naive Bayes has a beautiful derivation that follows directly from Bayes' theorem. We want the label y that maximizes the posterior:

ŷ = argmaxy P(Y = y | X = x)

By Bayes' theorem, this equals:

ŷ = argmaxy P(Y = y) · P(X = x | Y = y) / P(X = x)

Since P(X = x) is the same for all values of y, we can drop it:

ŷ = argmaxy P(Y = y) · P(X = x | Y = y)

Now here is the problem: P(X = x | Y = y) involves the joint probability of all m features together. If we have m binary features, there are 2m possible feature vectors. With m = 50 features, that is over 1015 combinations — far more than any training set could cover.

The Naive Bayes assumption: All features are conditionally independent given the label. This is almost certainly wrong (the word "free" and the word "money" are not independent in spam), but it lets us factor the joint probability into a product of individual probabilities. Wrong but useful.

With the independence assumption, P(X | Y) factors beautifully:

P(X = x | Y = y) = ∏i=1m P(Xi = xi | Y = y)

So the full prediction rule becomes:

ŷ = argmaxy P(Y = y) ∏i=1m P(Xi = xi | Y = y)

For numerical stability (products of many small probabilities underflow to zero), we take the log:

ŷ = argmaxy [ log P(Y = y) + ∑i=1m log P(Xi = xi | Y = y) ]

Training is just counting. Using MLE:

P̂(Xi = xi | Y = y) = (# examples where Xi = xi and Y = y) / (# examples where Y = y)

With Laplace smoothing (MAP with uniform prior) to avoid zero probabilities:

P̂(Xi = xi | Y = y) = (count + 1) / (class count + 2)

That is the entire algorithm. Training is O(N · m) — one pass through the data. Prediction is O(m) — one sum over features. Fast, simple, and surprisingly effective.

Check: Why do we take the log in the Naive Bayes prediction formula?

Chapter 3: Naive Bayes Example — Spam Filter

Let's build a spam filter from scratch. We have 10 training emails with 3 binary features:

EmailX1: "free"X2: "money"X3: "meeting"Y: spam?
11101 (spam)
21001
31101
40101
50000 (ham)
60010
71010
80010
90010
100100

Step 1: Compute priors. Of 10 emails, 4 are spam: P(Y=1) = 4/10 = 0.4. P(Y=0) = 6/10 = 0.6.

Step 2: Compute likelihoods with Laplace smoothing.

Among 4 spam emails: 3 have "free", 3 have "money", 0 have "meeting".

Among 6 ham emails: 1 has "free", 1 has "money", 4 have "meeting".

FeatureP(Xi=1 | spam), LaplaceP(Xi=1 | ham), Laplace
X1: "free"(3+1)/(4+2) = 4/6 = 0.667(1+1)/(6+2) = 2/8 = 0.250
X2: "money"(3+1)/(4+2) = 4/6 = 0.667(1+1)/(6+2) = 2/8 = 0.250
X3: "meeting"(0+1)/(4+2) = 1/6 = 0.167(4+1)/(6+2) = 5/8 = 0.625

Step 3: Classify a new email with x = [1, 1, 0] ("free" and "money", no "meeting").

Worked arithmetic — log scores:

Spam score: log(0.4) + log(0.667) + log(0.667) + log(1 − 0.167)
= −0.916 + (−0.405) + (−0.405) + (−0.182) = −1.908

Ham score: log(0.6) + log(0.250) + log(0.250) + log(1 − 0.625)
= −0.511 + (−1.386) + (−1.386) + (−0.981) = −4.264

Since −1.908 > −4.264, we predict spam. The words "free" and "money" strongly outweigh the absence of "meeting."

Notice how the log scores are just additions — no multiplication of tiny numbers needed. Also notice that Laplace smoothing saved us: without it, P("meeting"=0 | spam) would require P("meeting"=1 | spam) = 0/4, and the complement would be 4/4 = 1.0, which is fine. But if the new email did contain "meeting", the spam probability would be zero, killing the entire product. Laplace smoothing prevents any probability from being exactly 0.

Naive Bayes Training Visualizer

Shows the learned probability parameters for each feature given each class. Toggle Laplace smoothing on/off to see its effect.

With Laplace smoothing (+1/+2)
Check: In the worked example, what would P("free"=1 | spam) be WITHOUT Laplace smoothing?

Chapter 4: Logistic Regression

Logistic Regression takes the opposite approach from Naive Bayes. Instead of modeling P(X|Y) and using Bayes' theorem, it models P(Y|X) directly. The central assumption is beautifully simple:

P(Y = 1 | X = x) = σ(θTx) = σ(θ0 + θ1x1 + ... + θmxm)

where σ(z) = 1 / (1 + e−z) is the sigmoid function. This says: take a weighted sum of the features, push it through the sigmoid, and out comes a probability between 0 and 1.

The name is misleading — this is classification, not regression. "Logistic Classification" would have been a better name, but we are stuck with history.

The key insight: The "intelligence" of logistic regression lives entirely in the parameter vector θ. With good values of θ, the model produces accurate probabilities. With random values, it is useless. All of machine learning for logistic regression reduces to one question: how do we find good θ?

The answer: Maximum Likelihood Estimation — the exact same MLE you learned in Chapter 14. We find the θ that makes the training data most probable.

Notation: θTx is the dot product (weighted sum):

θTx = θ0x0 + θ1x1 + ... + θmxm

We always set x0 = 1 so that θ0 acts as a bias term (intercept). This lets the model shift its decision boundary without any features being present.

Here is a concrete example. Suppose we have two features and θ = [−3, 2, 1.5]. For an input x = [1, 1, 0] (bias=1, "free"=1, "money"=0):

Worked example:
z = θTx = (−3)(1) + (2)(1) + (1.5)(0) = −3 + 2 + 0 = −1
σ(−1) = 1 / (1 + e1) = 1 / (1 + 2.718) = 1 / 3.718 = 0.269

So P(spam | this email) = 0.269, and we predict not spam (since 0.269 < 0.5). The negative bias θ0 = −3 expresses a prior belief that most emails are not spam.
Sigmoid Function Explorer

Drag the slider to change z. The sigmoid squashes any real number into (0, 1). Notice: σ(0) = 0.5 exactly, large positive z → 1, large negative z → 0.

z0.0
Check: What does logistic regression assume about P(Y=1|X)?

Chapter 5: Sigmoid & Cross-Entropy

Each label y(i) is either 0 or 1. We model it as a Bernoulli: Y ~ Bern(p) where p = σ(θTx). A slick way to write the probability of one datapoint:

P(Y = y | X = x) = σ(θTx)y · [1 − σ(θTx)](1−y)

Check it: when y = 1, this gives σ(θTx). When y = 0, it gives 1 − σ(θTx). Perfect.

The likelihood of all N training examples (assuming independence):

L(θ) = ∏i=1N σ(θTx(i))y(i) · [1 − σ(θTx(i))](1−y(i))

Taking the log (products become sums, much nicer):

LL(θ) = ∑i=1N [ y(i) log σ(θTx(i)) + (1 − y(i)) log(1 − σ(θTx(i))) ]

This is the log-likelihood, also called the negative cross-entropy loss (up to a sign). Maximizing log-likelihood = minimizing cross-entropy. This is the standard loss function for classification in modern deep learning.

Worked example — log-likelihood for 3 datapoints:
Suppose θ = [0, 1] (no bias, one feature), and our data is:
(x=1, y=1), (x=0, y=0), (x=2, y=1)

σ(1) = 0.731, σ(0) = 0.500, σ(2) = 0.881

LL = log(0.731) + log(1 − 0.500) + log(0.881)
= −0.313 + (−0.693) + (−0.127) = −1.133

A perfect model would have LL = 0 (all predictions are 1.0 for correct labels). Our LL of −1.133 says we are decent but not perfect. The second datapoint hurts the most (σ(0) = 0.5 is maximally uncertain).

An important property of the sigmoid: its derivative is remarkably clean.

dσ/dz = σ(z) · (1 − σ(z))

This means the sigmoid is its own derivative (almost). This elegance is why the sigmoid was so popular in early neural networks — backpropagation becomes simple when your activation function has a clean derivative.

The derivative peaks at z = 0 (where σ = 0.5) with value 0.25, and approaches 0 as z → ±∞. This is the vanishing gradient problem — when the model is very confident (large |z|), learning slows to a crawl.

Check: What is dσ/dz evaluated at z = 0?

Chapter 6: Gradient Descent for LR

We have the log-likelihood function. We want to find θ that maximizes it. Unlike MLE for simple distributions, there is no closed-form solution — we cannot solve ∂LL/∂θ = 0 analytically. Instead, we use an iterative optimization algorithm: gradient ascent.

First, the gradient. The partial derivative of log-likelihood with respect to each parameter θj turns out to be beautifully simple:

∂LL / ∂θj = ∑i=1N [ y(i) − σ(θTx(i)) ] · xj(i)

Read this carefully: for each training example, compute the error (true label minus predicted probability), multiply by the feature value, and sum over all examples. That is the gradient.

The gradient ascent update (ascending to maximize LL):

θjnew = θjold + η · ∑i=1N [ y(i) − σ(θTx(i)) ] · xj(i)

where η (eta) is the learning rate — how big a step we take. Too large and we overshoot; too small and we take forever.

Worked gradient step:
Data: (x=[1,1], y=1) and (x=[1,0], y=0). Current θ = [0, 0]. η = 0.1.

Predictions: σ(0·1 + 0·1) = σ(0) = 0.5, σ(0·1 + 0·0) = σ(0) = 0.5.

Errors: (1 − 0.5) = 0.5 for point 1, (0 − 0.5) = −0.5 for point 2.

Gradient for θ0: 0.5 · 1 + (−0.5) · 1 = 0
Gradient for θ1: 0.5 · 1 + (−0.5) · 0 = 0.5

θ0new = 0 + 0.1 · 0 = 0
θ1new = 0 + 0.1 · 0.5 = 0.05

After one step, θ1 increased. This makes sense: x1 = 1 in the positive example, so we increase its weight.
pseudocode
# Gradient Ascent for Logistic Regression
Initialize θ = [0, 0, ..., 0]
Set learning rate η

repeat until convergence:
  for each θj:
    gradient = 0
    for each training example (x(i), y(i)):
      gradient += (y(i) − σ(θTx(i))) · xj(i)
    θj = θj + η · gradient
Gradient Descent for Logistic Regression

Watch θ converge as gradient ascent finds the best decision boundary. Blue dots are class 0, orange dots are class 1. The curve shows P(Y=1|x). Click Step to advance one gradient update, or Auto to animate.

step 0 | θ=[0.00, 0.00]
η0.20
Check: In the gradient formula, (y(i) − σ(θTx(i))) represents:

Chapter 7: Diffusion Models

Let's end with something stunning. You want to generate realistic pictures of trees. You have a dataset of tree photos. How do you teach a computer to create new tree photos that look real but never existed?

The idea behind diffusion models is elegant and deeply probabilistic. It has two phases:

Forward Process
Gradually add Gaussian noise to images until they become pure noise
↓ train a neural net to reverse each step
Reverse Process
Start from pure noise, iteratively denoise to generate a new image

Forward process: At each timestep t, add Gaussian noise to the pixel values:

xt+1 = xt + nt    where nt ~ N(0, σ2)

After enough steps (say T = 10 with 10% noise each), the original image is completely destroyed — every pixel is just Gaussian noise. The image of a tree becomes static on a TV screen.

Reverse process: Here is the beautiful part. If σ2 is small enough, the conditional distribution xt−1 | xt is approximately Gaussian:

xt−1 | xt ~ N(μt−1(xt), σ2)

The variance σ2 is known (it is the same noise we added). We only need to estimate the mean μt−1(xt). This is where the neural network comes in.

The probabilistic magic: Since both the forward and reverse distributions are Gaussian, the KL divergence between the predicted Gaussian and the true Gaussian simplifies to mean squared error. Training a diffusion model = training a denoiser with MSE loss. All the fancy generative modeling reduces to regression.

Why is xt−1|xt Gaussian? By Bayes' theorem:

P(xt−1|xt) = P(xt|xt−1) · P(xt−1) / P(xt)

Taking the log and using the fact that the forward process P(xt|xt−1) is Gaussian, plus a Taylor expansion of log P(xt−1) around xt (valid when noise is small), we can complete the square and show the result is Gaussian with mean:

μt−1 = xt + σ2 · ∇x log P(xt)

The term ∇x log P(xt) is called the score function — it points toward regions of higher probability. The neural network learns to estimate this score, guiding noisy images back toward realistic ones.

Worked numerical example:
Suppose a single pixel has value xt = 0.5, σ2 = 0.01, and the neural net predicts μt−1 = 0.48.

The denoised pixel: xt−1 ~ N(0.48, 0.01).
MSE loss if the true clean pixel was 0.47: (0.48 − 0.47)2 = 0.0001.

After 10 such denoising steps, starting from pure noise xT ~ N(0, 1), the model reconstructs a plausible image pixel by pixel.
ComponentWhat it does
Forward processAdd known Gaussian noise (destroy information)
Neural networkPredict μt−1(xt) (estimate denoised mean)
Loss functionMSE between predicted and actual denoised pixels
GenerationSample noise, denoise T times to get a new image
Check: Why does the diffusion model training loss reduce to MSE?

Chapter 8: Showcase — Interactive Spam Classifier

Time to put it all together. Below is a complete Naive Bayes spam classifier. You can compose an email by toggling features on/off, and watch the classifier compute log-scores for spam and ham in real time.

The model is pre-trained on the 10-email dataset from Chapter 3. Try toggling different feature combinations and see how the classification changes. Pay attention to which features have the strongest pull toward spam.

Interactive Naive Bayes Spam Classifier

Toggle email features and watch the Naive Bayes score update live. The bar chart shows log P(class | features) for spam and ham. The classifier picks the higher score.

Toggle features to classify
Things to try:
• Turn on "free" and "money" together — strong spam signal.
• Turn on "meeting" and "report" — strong ham signal.
• Turn on all five — watch the competing signals.
• Notice the log-score bars: each feature adds or subtracts a fixed amount (that is the conditional independence assumption in action).

Notice that each feature contributes additively to the log-score — this is the direct consequence of the Naive Bayes independence assumption. In reality, "free" and "money" appearing together is much more suspicious than either alone, but our model treats them independently. This is the price of the naive assumption, and it is why logistic regression (which can implicitly learn feature interactions via its weights) often outperforms Naive Bayes on large datasets.

Check: In the interactive demo, why does each feature contribute a fixed additive amount to the log-score regardless of other features?

Chapter 9: Connections

This chapter is the capstone of the book. Everything you have learned — counting, probability, conditional probability, Bayes' theorem, random variables, expectation, distributions, joint distributions, estimation — all of it converges into machine learning. Here is the map:

Earlier conceptHow it appears in ML
Bayes' theorem (Ch 3)The entire foundation of Naive Bayes classification
Conditional independence (Ch 3)The "naive" assumption that makes NB tractable
MLE (Ch 14)Training Naive Bayes (counting) and Logistic Regression (gradient ascent)
MAP estimation (Ch 14)Laplace smoothing in Naive Bayes
Bernoulli distribution (Ch 7)Each label in logistic regression is a Bernoulli
Gaussian distribution (Ch 8)Forward and reverse processes in diffusion models
KL divergenceTraining loss for diffusion models reduces to MSE for Gaussians
The big picture: Naive Bayes and Logistic Regression are not just two algorithms — they represent two fundamental philosophies (generative vs. discriminative) that pervade all of machine learning. Naive Bayes scales to neural generative models (VAEs, diffusion). Logistic regression is literally the building block of every neural network — each neuron is a logistic regression unit. When you stack many of them, you get deep learning.
Naive Bayes
Model P(X|Y) → generative models → VAEs → diffusion → Stable Diffusion, DALL-E
Logistic Regression
Model P(Y|X) → neural networks → deep learning → GPT, AlphaFold
Both use MLE
Find parameters that maximize the probability of training data

What to study next:

Neural networks — stack logistic regression units with nonlinearities. CS229 or CS231n at Stanford.

Support vector machines — a different discriminative approach based on margin maximization.

Bayesian deep learning — combining the generative philosophy with modern neural networks.

Diffusion models in depth — the full DDPM framework, score matching, and classifier-free guidance.

You now have the probabilistic foundation for all of machine learning. Everything from logistic regression to GPT-4 builds on the ideas in this book. The math does not change — only the scale.

Check: Logistic regression is the building block of neural networks because: