From probability to prediction — how Bayes' theorem and MLE become classifiers that learn from data.
You receive an email. Is it spam or not? You have a dataset of 1,000 labeled emails — some spam, some legitimate. Each email has features: does it contain the word "free"? Does it contain "meeting"? Does it have an attachment? Your job is to build a function that takes these features and predicts the label.
This is classification — using training data with feature/label pairs (x, y) to learn a function ŷ = g(x) that predicts labels for new, unseen examples. It is the bread and butter of machine learning.
But here is the key insight of this chapter: classification is just applied probability. The optimal classifier picks the label with the highest posterior probability:
That formula should look familiar. It is Bayes' theorem in action — the same Bayes' theorem from Chapter 3. The entire machinery of probability that we have built — conditional probability, independence, MLE, MAP estimation — now becomes the foundation for making machines learn.
In this chapter, we assume binary classification: labels y ∈ {0, 1} and binary features xj ∈ {0, 1}. This keeps the arithmetic clean while preserving all the important ideas. The extension to more classes and continuous features is straightforward.
| Setup | Details |
|---|---|
| Training data | N pairs (x(i), y(i)) where x is a vector of m binary features |
| Labels | y ∈ {0, 1} (binary classification) |
| Features | xj ∈ {0, 1} for j = 1, ..., m |
| Goal | Learn g(x) = argmaxy P(Y = y | X = x) |
The two algorithms we cover take fundamentally different approaches. Naive Bayes models P(X | Y) and uses Bayes' theorem to flip it around (generative). Logistic Regression models P(Y | X) directly (discriminative). Both are powerful, both are rooted in probability, and both will prepare you for neural networks.
There are two fundamentally different philosophies for building a classifier. To see the difference, consider predicting whether an animal is a dog or a cat based on features like weight and ear shape.
A generative classifier learns what dogs look like and what cats look like separately. It models the joint distribution — P(features | dog) and P(features | cat) — then uses Bayes' theorem to classify. It could even generate synthetic examples of dogs and cats, hence the name.
A discriminative classifier does not bother learning what each class looks like. It directly learns the decision boundary between dogs and cats: P(dog | features). It only cares about the dividing line, not the full picture.
Let's make this concrete with numbers. Suppose we have 100 training emails: 40 spam, 60 not-spam. Each email has two features: contains "free" (X1) and contains "money" (X2).
Generative approach: Estimate P(X1=1 | spam) = 30/40 = 0.75, P(X1=1 | not-spam) = 6/60 = 0.10, etc. Then for a new email with "free" and "money", compute P(spam | features) via Bayes' theorem.
Discriminative approach: Directly find weights θ such that σ(θ0 + θ1x1 + θ2x2) ≈ P(spam | features). No need to model each class separately.
| Property | Generative (Naive Bayes) | Discriminative (Logistic Reg) |
|---|---|---|
| Models | P(X|Y) and P(Y) | P(Y|X) directly |
| Training | Count frequencies (closed form) | Gradient descent (iterative) |
| With little data | Often better (strong prior) | May overfit |
| With lots of data | Limited by independence assumption | Often better |
| Can generate samples? | Yes | No |
Naive Bayes has a beautiful derivation that follows directly from Bayes' theorem. We want the label y that maximizes the posterior:
By Bayes' theorem, this equals:
Since P(X = x) is the same for all values of y, we can drop it:
Now here is the problem: P(X = x | Y = y) involves the joint probability of all m features together. If we have m binary features, there are 2m possible feature vectors. With m = 50 features, that is over 1015 combinations — far more than any training set could cover.
With the independence assumption, P(X | Y) factors beautifully:
So the full prediction rule becomes:
For numerical stability (products of many small probabilities underflow to zero), we take the log:
Training is just counting. Using MLE:
With Laplace smoothing (MAP with uniform prior) to avoid zero probabilities:
That is the entire algorithm. Training is O(N · m) — one pass through the data. Prediction is O(m) — one sum over features. Fast, simple, and surprisingly effective.
Let's build a spam filter from scratch. We have 10 training emails with 3 binary features:
| X1: "free" | X2: "money" | X3: "meeting" | Y: spam? | |
|---|---|---|---|---|
| 1 | 1 | 1 | 0 | 1 (spam) |
| 2 | 1 | 0 | 0 | 1 |
| 3 | 1 | 1 | 0 | 1 |
| 4 | 0 | 1 | 0 | 1 |
| 5 | 0 | 0 | 0 | 0 (ham) |
| 6 | 0 | 0 | 1 | 0 |
| 7 | 1 | 0 | 1 | 0 |
| 8 | 0 | 0 | 1 | 0 |
| 9 | 0 | 0 | 1 | 0 |
| 10 | 0 | 1 | 0 | 0 |
Step 1: Compute priors. Of 10 emails, 4 are spam: P(Y=1) = 4/10 = 0.4. P(Y=0) = 6/10 = 0.6.
Step 2: Compute likelihoods with Laplace smoothing.
Among 4 spam emails: 3 have "free", 3 have "money", 0 have "meeting".
Among 6 ham emails: 1 has "free", 1 has "money", 4 have "meeting".
| Feature | P(Xi=1 | spam), Laplace | P(Xi=1 | ham), Laplace |
|---|---|---|
| X1: "free" | (3+1)/(4+2) = 4/6 = 0.667 | (1+1)/(6+2) = 2/8 = 0.250 |
| X2: "money" | (3+1)/(4+2) = 4/6 = 0.667 | (1+1)/(6+2) = 2/8 = 0.250 |
| X3: "meeting" | (0+1)/(4+2) = 1/6 = 0.167 | (4+1)/(6+2) = 5/8 = 0.625 |
Step 3: Classify a new email with x = [1, 1, 0] ("free" and "money", no "meeting").
Notice how the log scores are just additions — no multiplication of tiny numbers needed. Also notice that Laplace smoothing saved us: without it, P("meeting"=0 | spam) would require P("meeting"=1 | spam) = 0/4, and the complement would be 4/4 = 1.0, which is fine. But if the new email did contain "meeting", the spam probability would be zero, killing the entire product. Laplace smoothing prevents any probability from being exactly 0.
Shows the learned probability parameters for each feature given each class. Toggle Laplace smoothing on/off to see its effect.
Logistic Regression takes the opposite approach from Naive Bayes. Instead of modeling P(X|Y) and using Bayes' theorem, it models P(Y|X) directly. The central assumption is beautifully simple:
where σ(z) = 1 / (1 + e−z) is the sigmoid function. This says: take a weighted sum of the features, push it through the sigmoid, and out comes a probability between 0 and 1.
The name is misleading — this is classification, not regression. "Logistic Classification" would have been a better name, but we are stuck with history.
The answer: Maximum Likelihood Estimation — the exact same MLE you learned in Chapter 14. We find the θ that makes the training data most probable.
Notation: θTx is the dot product (weighted sum):
We always set x0 = 1 so that θ0 acts as a bias term (intercept). This lets the model shift its decision boundary without any features being present.
Here is a concrete example. Suppose we have two features and θ = [−3, 2, 1.5]. For an input x = [1, 1, 0] (bias=1, "free"=1, "money"=0):
Drag the slider to change z. The sigmoid squashes any real number into (0, 1). Notice: σ(0) = 0.5 exactly, large positive z → 1, large negative z → 0.
Each label y(i) is either 0 or 1. We model it as a Bernoulli: Y ~ Bern(p) where p = σ(θTx). A slick way to write the probability of one datapoint:
Check it: when y = 1, this gives σ(θTx). When y = 0, it gives 1 − σ(θTx). Perfect.
The likelihood of all N training examples (assuming independence):
Taking the log (products become sums, much nicer):
This is the log-likelihood, also called the negative cross-entropy loss (up to a sign). Maximizing log-likelihood = minimizing cross-entropy. This is the standard loss function for classification in modern deep learning.
An important property of the sigmoid: its derivative is remarkably clean.
This means the sigmoid is its own derivative (almost). This elegance is why the sigmoid was so popular in early neural networks — backpropagation becomes simple when your activation function has a clean derivative.
The derivative peaks at z = 0 (where σ = 0.5) with value 0.25, and approaches 0 as z → ±∞. This is the vanishing gradient problem — when the model is very confident (large |z|), learning slows to a crawl.
We have the log-likelihood function. We want to find θ that maximizes it. Unlike MLE for simple distributions, there is no closed-form solution — we cannot solve ∂LL/∂θ = 0 analytically. Instead, we use an iterative optimization algorithm: gradient ascent.
First, the gradient. The partial derivative of log-likelihood with respect to each parameter θj turns out to be beautifully simple:
Read this carefully: for each training example, compute the error (true label minus predicted probability), multiply by the feature value, and sum over all examples. That is the gradient.
The gradient ascent update (ascending to maximize LL):
where η (eta) is the learning rate — how big a step we take. Too large and we overshoot; too small and we take forever.
pseudocode # Gradient Ascent for Logistic Regression Initialize θ = [0, 0, ..., 0] Set learning rate η repeat until convergence: for each θj: gradient = 0 for each training example (x(i), y(i)): gradient += (y(i) − σ(θTx(i))) · xj(i) θj = θj + η · gradient
Watch θ converge as gradient ascent finds the best decision boundary. Blue dots are class 0, orange dots are class 1. The curve shows P(Y=1|x). Click Step to advance one gradient update, or Auto to animate.
Let's end with something stunning. You want to generate realistic pictures of trees. You have a dataset of tree photos. How do you teach a computer to create new tree photos that look real but never existed?
The idea behind diffusion models is elegant and deeply probabilistic. It has two phases:
Forward process: At each timestep t, add Gaussian noise to the pixel values:
After enough steps (say T = 10 with 10% noise each), the original image is completely destroyed — every pixel is just Gaussian noise. The image of a tree becomes static on a TV screen.
Reverse process: Here is the beautiful part. If σ2 is small enough, the conditional distribution xt−1 | xt is approximately Gaussian:
The variance σ2 is known (it is the same noise we added). We only need to estimate the mean μt−1(xt). This is where the neural network comes in.
Why is xt−1|xt Gaussian? By Bayes' theorem:
Taking the log and using the fact that the forward process P(xt|xt−1) is Gaussian, plus a Taylor expansion of log P(xt−1) around xt (valid when noise is small), we can complete the square and show the result is Gaussian with mean:
The term ∇x log P(xt) is called the score function — it points toward regions of higher probability. The neural network learns to estimate this score, guiding noisy images back toward realistic ones.
| Component | What it does |
|---|---|
| Forward process | Add known Gaussian noise (destroy information) |
| Neural network | Predict μt−1(xt) (estimate denoised mean) |
| Loss function | MSE between predicted and actual denoised pixels |
| Generation | Sample noise, denoise T times to get a new image |
Time to put it all together. Below is a complete Naive Bayes spam classifier. You can compose an email by toggling features on/off, and watch the classifier compute log-scores for spam and ham in real time.
The model is pre-trained on the 10-email dataset from Chapter 3. Try toggling different feature combinations and see how the classification changes. Pay attention to which features have the strongest pull toward spam.
Toggle email features and watch the Naive Bayes score update live. The bar chart shows log P(class | features) for spam and ham. The classifier picks the higher score.
Notice that each feature contributes additively to the log-score — this is the direct consequence of the Naive Bayes independence assumption. In reality, "free" and "money" appearing together is much more suspicious than either alone, but our model treats them independently. This is the price of the naive assumption, and it is why logistic regression (which can implicitly learn feature interactions via its weights) often outperforms Naive Bayes on large datasets.
This chapter is the capstone of the book. Everything you have learned — counting, probability, conditional probability, Bayes' theorem, random variables, expectation, distributions, joint distributions, estimation — all of it converges into machine learning. Here is the map:
| Earlier concept | How it appears in ML |
|---|---|
| Bayes' theorem (Ch 3) | The entire foundation of Naive Bayes classification |
| Conditional independence (Ch 3) | The "naive" assumption that makes NB tractable |
| MLE (Ch 14) | Training Naive Bayes (counting) and Logistic Regression (gradient ascent) |
| MAP estimation (Ch 14) | Laplace smoothing in Naive Bayes |
| Bernoulli distribution (Ch 7) | Each label in logistic regression is a Bernoulli |
| Gaussian distribution (Ch 8) | Forward and reverse processes in diffusion models |
| KL divergence | Training loss for diffusion models reduces to MSE for Gaussians |
What to study next:
• Neural networks — stack logistic regression units with nonlinearities. CS229 or CS231n at Stanford.
• Support vector machines — a different discriminative approach based on margin maximization.
• Bayesian deep learning — combining the generative philosophy with modern neural networks.
• Diffusion models in depth — the full DDPM framework, score matching, and classifier-free guidance.
You now have the probabilistic foundation for all of machine learning. Everything from logistic regression to GPT-4 builds on the ideas in this book. The math does not change — only the scale.