EE269 Lecture 14 — Mert Pilanci, Stanford

Bayes Classifiers & Detection

The provably optimal classifier — and the likelihood ratio test that powers radar, medicine, and spam filters.

Prerequisites: EE269 Lecture 9 (classification basics) + Probability (Bayes' rule). That's it.
8
Chapters
6+
Simulations
0
Assumed Knowledge

Chapter 0: The Classification Problem

A radar station stares into the sky. Every second it receives a signal x — a vector of N samples. Sometimes the signal is pure noise (no aircraft). Sometimes there's a faint echo from a plane buried in that noise. The station must decide: target present or noise only?

Get it wrong in one direction: you scramble jets for nothing (false alarm). Get it wrong in the other: you miss an incoming threat (missed detection). The cost of each mistake is wildly different.

This is the binary classification problem. We have a feature vector x ∈ ℝN and must assign it to one of K classes. A classifier is a function f: ℝN → {1, 2, ..., K}. The question is: what's the best possible classifier?

The central question: Among ALL possible classifiers — every conceivable function from ℝN to {1,...,K} — is there one that makes the fewest mistakes? Yes. It's called the Bayes classifier, and it's provably optimal. No other classifier, no matter how sophisticated, can beat it.

Formal Setup

We model the world probabilistically. A random label y ∈ {1, 2, ..., K} is drawn from a prior distribution πk = P(y = k). Then, given y = k, a random feature vector x is drawn from a class-conditional density gk(x) = p(x | y = k).

The classifier sees x but NOT y. It must guess y. Its risk (probability of error) is:

R(f) = P(f(x) ≠ y)

The Bayes risk R* is the smallest achievable risk over all classifiers:

R* = inff R(f)

No classifier — not neural networks, not SVMs, nothing — can achieve risk lower than R*. It's the fundamental limit set by the overlap between the class distributions.

Two Classes with Overlapping Distributions

Class 1 (teal) and Class 2 (orange) have overlapping densities. The overlap region is where errors are unavoidable. Drag the means apart to reduce the Bayes risk.

Class 2 mean μ2 2.0
Why can't any classifier achieve zero risk when two class distributions overlap?

Chapter 1: Bayes Risk

Let's derive the optimal classifier from scratch. We need to find f*(x) that minimizes R(f) = P(f(x) ≠ y). Start by writing out the risk using the law of total probability.

The Posterior

Given that we observe x, the posterior probability that the true class is k is:

ηk(x) = P(y = k | x) = πk · gk(x) / p(x)

where p(x) = ∑j πj gj(x) is the marginal density of x. This is just Bayes' rule applied to classification.

The three ingredients are:

Prior πk = P(y = k) — how likely is class k before seeing any data?

Class-conditional gk(x) = p(x | y = k) — if the class IS k, how likely is this x?

Posterior ηk(x) = P(y = k | x) — given we saw x, how likely is class k?

Deriving the Bayes Classifier

The probability of error for a classifier f at a specific point x is:

P(error | x) = P(f(x) ≠ y | x) = 1 − ηf(x)(x)

Because if we assign class f(x), the probability that's correct is ηf(x)(x), and the error probability is 1 minus that. To minimize the total risk R(f) = Ex[P(error | x)], we minimize the integrand pointwise. At each x, we should choose the class with the highest posterior:

f*(x) = argmaxk ηk(x) = argmaxk πk · gk(x)

The denominator p(x) is the same for all k, so it doesn't affect the argmax. We only need the numerator πk · gk(x).

The Bayes classifier: At every point x, pick the class k that maximizes πk · gk(x). This is optimal — no other classifier has lower risk. The Bayes risk is R* = Ex[1 − maxk ηk(x)].

Worked Example

Two classes, equal priors (π1 = π2 = 0.5). Class 1: g1(x) = N(0, 1). Class 2: g2(x) = N(2, 1).

At x = 0.5:

• π1 g1(0.5) = 0.5 × 0.352 = 0.176

• π2 g2(0.5) = 0.5 × 0.130 = 0.065

• Class 1 wins. f*(0.5) = 1.

At x = 1.5:

• π1 g1(1.5) = 0.5 × 0.130 = 0.065

• π2 g2(1.5) = 0.5 × 0.352 = 0.176

• Class 2 wins. f*(1.5) = 2.

The boundary is at x = 1.0 (the midpoint of the means, since priors are equal and variances are equal). The Bayes risk here is R* = Φ(-1) ≈ 0.159, where Φ is the standard normal CDF.

Effect of Priors

If priors are unequal, the boundary shifts. With π1 = 0.9 and π2 = 0.1, class 1 gets a "head start" — the boundary moves right (toward μ2), so more of the space is assigned to the more likely class. This makes intuitive sense: if 90% of signals are noise, you should be reluctant to declare a detection.

Bayes Classifier: Posterior vs. Decision Boundary

Two Gaussian classes. The Bayes classifier picks the class with the higher posterior at each x. The decision boundary is where the posteriors cross. Adjust the prior to see the boundary shift.

Prior π1 0.50
Class 2 mean μ2 2.5
If π1 = 0.9 and π2 = 0.1 with equal Gaussian variances, the Bayes decision boundary:

Chapter 2: Signal Detection

Now let's apply the Bayes framework to the radar problem from Chapter 0. This is the signal detection problem — the Gaussian special case that shows up everywhere from radar to medical imaging to communications.

The Model

You receive a vector x = [x[0], x[1], ..., x[N-1]]T ∈ ℝN.

Hypothesis H1 (noise only): x ~ N(0, σ2I). Each sample x[n] is i.i.d. Gaussian noise with mean 0 and variance σ2.

Hypothesis H2 (signal + noise): x ~ N(μ, σ2I). The same noise, but now a known signal μ = [μ[0], ..., μ[N-1]]T is added.

Under H1 (noise only):

g1(x) = (2πσ2)-N/2 exp(−||x||2 / (2σ2))

Under H2 (signal + noise):

g2(x) = (2πσ2)-N/2 exp(−||x − μ||2 / (2σ2))

Worked Example: N = 4 Samples

Let μ = [1, 1, 1, 1]T and σ2 = 1. You observe x = [0.3, 0.8, 1.2, 0.5].

Compute the exponents:

• ||x||2 = 0.09 + 0.64 + 1.44 + 0.25 = 2.42

• ||x − μ||2 = 0.49 + 0.04 + 0.04 + 0.25 = 0.82

The normalization constants are the same for both, so comparing g1(x) vs g2(x) reduces to comparing:

• exp(−2.42/2) = exp(−1.21) = 0.298 (noise hypothesis)

• exp(−0.82/2) = exp(−0.41) = 0.664 (signal hypothesis)

Signal hypothesis wins — the data is much closer to μ than to 0. This makes intuitive sense: the observed values [0.3, 0.8, 1.2, 0.5] are clearly shifted positive, consistent with a signal of μ = [1,1,1,1].

Key insight: The Bayes classifier for Gaussian signal detection is comparing distances: ||x||2 vs ||x − μ||2. Is x closer to the origin (noise) or closer to μ (signal)? The decision boundary is the perpendicular bisector of the line from 0 to μ.

Simplifying the Comparison

Let's expand ||x − μ||2:

||x − μ||2 = ||x||2 − 2xTμ + ||μ||2

Compare g2(x) ≥ g1(x), which means ||x||2 ≥ ||x − μ||2:

||x||2 ≥ ||x||2 − 2xTμ + ||μ||2

The ||x||2 terms cancel!

2xTμ ≥ ||μ||2
xTμ ≥ ||μ||2 / 2

For our constant-signal case μ = [A, A, ..., A]T, this becomes:

A · ∑n x[n] ≥ N · A2 / 2
x̄ = (1/N) ∑n x[n] ≥ A/2
The sufficient statistic: For Gaussian detection with a constant signal μ = A·1, the optimal test only depends on the sample mean x̄. You don't need to look at each individual sample — just their average! The sample mean is a sufficient statistic for this problem. It compresses N numbers into one number with zero loss of information (for this decision).

For our worked example: x̄ = (0.3 + 0.8 + 1.2 + 0.5)/4 = 0.7. Threshold = A/2 = 0.5. Since 0.7 ≥ 0.5, we declare "signal present." Matches our earlier calculation.

Signal Detection in Noise

Received signal (white) vs known signal template μ (orange dashed). The test statistic xTμ compares how well x matches the template. Click "New Sample" to generate random observations.

Signal amplitude A 1.5
Noise σ 1.0
For Gaussian signal detection with μ = [A, A, ..., A]T in N(0, σ2I) noise, the sufficient statistic is:

Chapter 3: Likelihood Ratio Test

The Bayes classifier for two classes reduces to a beautifully simple form: the likelihood ratio test (LRT). Let's derive it.

Derivation

For K = 2 classes, the Bayes classifier assigns x to class 2 when π2 g2(x) > π1 g1(x). Rearranging:

Λ(x) = g2(x) / g1(x) ≥ π1 / π2 = τ

The left side Λ(x) is the likelihood ratio — how many times more likely is x under H2 compared to H1. The right side τ is the threshold, determined by the priors.

The Likelihood Ratio Test: Compute the ratio of how likely x is under each hypothesis. If this ratio exceeds the prior-determined threshold, declare H2. That's it. Every Bayes-optimal binary classifier is a likelihood ratio test.

Log-Likelihood Ratio

In practice, we take the logarithm (since log is monotonic, it doesn't change the decision):

log Λ(x) = log g2(x) − log g1(x) ≥ log τ

This is numerically more stable (avoids multiplying many small probabilities) and turns products into sums.

Gaussian LRT

For our signal detection problem (H1: N(0, σ2I), H2: N(μ, σ2I)):

log Λ(x) = log g2(x) − log g1(x)
= −||x − μ||2/(2σ2) + ||x||2/(2σ2)
= (2xTμ − ||μ||2) / (2σ2)
= (xTμ − ||μ||2/2) / σ2

Since σ2 > 0, dividing by it preserves the inequality. The test becomes:

xTμ ≥ ||μ||2/2 + σ2 log(π12)

With equal priors (π1 = π2), the log term vanishes and we recover xTμ ≥ ||μ||2/2 from Chapter 2.

Worked Example

μ = [2, 0]T, σ2 = 1, π1 = 0.7, π2 = 0.3.

• ||μ||2/2 = 4/2 = 2

• σ2 log(π12) = 1 × log(0.7/0.3) = log(2.33) ≈ 0.847

• Threshold on xTμ: 2 + 0.847 = 2.847

For x = [1.5, 0.3]T: xTμ = 1.5 × 2 + 0.3 × 0 = 3.0 ≥ 2.847. Declare H2 (signal present).

For x = [1.2, −0.5]T: xTμ = 1.2 × 2 + (−0.5) × 0 = 2.4 < 2.847. Declare H1 (noise).

The role of priors: When π1 = 0.7 (noise is more common), the threshold shifts up — you need stronger evidence (higher xTμ) to declare "signal present." The prior encodes your default belief. Unequal priors make the test conservative toward the more likely hypothesis.
Likelihood Ratio Test Visualized

The two Gaussian densities and the likelihood ratio Λ(x). The vertical line shows the threshold τ. Regions left/right of the threshold are decision regions for H1/H2.

Prior π1 0.50
μ2 2.5
The likelihood ratio Λ(x) = g2(x)/g1(x) measures:

Chapter 4: ROC Curves

The Bayes classifier uses a threshold determined by the priors. But what if you don't know the priors? Or what if the costs of different errors are unequal? The ROC curve (Receiver Operating Characteristic) shows the full trade-off between the two types of errors as you sweep the threshold.

Two Types of Errors

In binary detection (H1 = noise, H2 = signal):

Decide H1Decide H2
Truth: H1Correct rejectionFalse alarm (Type I)
Truth: H2Missed detection (Type II)Detection (hit)

Define the key rates:

PFA = P(decide H2 | H1 is true) = probability of false alarm

PD = P(decide H2 | H2 is true) = probability of detection

We want PD high (detect real signals) and PFA low (don't trigger on noise). But they're in tension: lowering the threshold detects more signals but also creates more false alarms.

Constructing the ROC

For the Gaussian LRT with test statistic T(x) = xTμ and threshold γ:

• Under H1: T ~ N(0, σ2||μ||2), so PFA = P(T ≥ γ | H1) = Q(γ / (σ||μ||))

• Under H2: T ~ N(||μ||2, σ2||μ||2), so PD = P(T ≥ γ | H2) = Q((γ − ||μ||2) / (σ||μ||))

where Q(z) = P(Z ≥ z) for Z ~ N(0,1) is the Q-function (complementary CDF).

As γ sweeps from −∞ to +∞:

• γ = −∞: always declare H2, so PFA = PD = 1 (top-right corner)

• γ = +∞: always declare H1, so PFA = PD = 0 (origin)

• Intermediate γ: a curve in the (PFA, PD) plane

Reading an ROC curve: The curve goes from (0,0) to (1,1). A perfect detector hugs the top-left corner (PD = 1, PFA = 0). A random coin flip gives the diagonal line PD = PFA. Any useful detector lies above the diagonal. The area under the ROC curve (AUC) summarizes performance: AUC = 1 is perfect, AUC = 0.5 is random guessing.

SNR and the ROC

The shape of the ROC depends on the signal-to-noise ratio (SNR):

SNR = ||μ||2 / σ2

Higher SNR pushes the ROC curve toward the top-left corner. At infinite SNR, the signal is so strong that you can achieve PD = 1 with PFA = 0. At zero SNR (μ = 0), the curve is the diagonal — the signal is indistinguishable from noise.

For N samples of a constant signal with amplitude A in noise with variance σ2:

SNR = N · A2 / σ2

More samples improve detection linearly! Doubling N is equivalent to doubling A2. This is the processing gain of matched filtering.

Interactive ROC Curve

Drag the threshold slider to move along the ROC curve. Watch how PFA and PD change simultaneously. Adjust SNR to see the curve shape change.

Threshold γ 1.50
SNR (dB) 6.0
On an ROC curve, a random-guessing classifier (coin flip) lies along:

Chapter 5: Neyman-Pearson

The Bayes framework assumes you know the priors π1, π2. But in many applications, the priors are unknown or irrelevant. In radar, you don't care what fraction of time slots contain targets — you care about not missing a target while keeping false alarms manageable.

The Neyman-Pearson approach takes a different perspective: maximize the probability of detection PD subject to a constraint on the false alarm rate PFA ≤ α.

The Neyman-Pearson Lemma

Neyman-Pearson Lemma: Among all tests with PFA ≤ α, the test that maximizes PD is the likelihood ratio test: declare H2 when Λ(x) ≥ τ, where τ is chosen so that PFA = α exactly. In other words: the LRT is not just Bayes-optimal — it's also Neyman-Pearson optimal!

This is a profound result. It says the same test structure (compare the likelihood ratio to a threshold) is optimal under two completely different criteria:

1. Minimizing overall error probability (Bayes, threshold set by priors)

2. Maximizing detection for a fixed false alarm rate (Neyman-Pearson, threshold set by α)

Only the threshold value changes. The sufficient statistic is the same.

Setting the Threshold

For the Gaussian case, the test statistic T = xTμ has distribution N(0, σ2||μ||2) under H1. We need:

PFA = P(T ≥ γ | H1) = Q(γ / (σ||μ||)) = α

So the threshold is:

γ = σ ||μ|| · Q−1(α)

where Q−1 is the inverse Q-function.

Worked Example

α = 0.01 (1% false alarm rate), σ = 1, ||μ|| = 2.

• Q−1(0.01) = 2.326

• γ = 1 × 2 × 2.326 = 4.652

• Resulting PD = Q((4.652 − 4)/(1 × 2)) = Q(0.326) ≈ 0.372

At 1% false alarm rate with SNR = 4, we only detect 37% of targets. Harsh. To improve: increase SNR (more power, more samples, or less noise).

If we relax to α = 0.05:

• γ = 2 × 1.645 = 3.290

• PD = Q((3.290 − 4)/2) = Q(−0.355) = 1 − Q(0.355) ≈ 0.639

Allowing 5% false alarms nearly doubles our detection rate.

The fundamental trade-off: On the ROC curve, the Neyman-Pearson test picks the point where PFA = α. Moving left on the ROC (lower α) decreases both PFA and PD. There's no free lunch — reducing one type of error increases the other. The ROC curve IS the set of achievable (PFA, PD) pairs.

Applications

ApplicationH1H2Typical α
RadarNo targetTarget present10−6
Medical screeningHealthyDisease0.05
Spam filterLegitimateSpam0.01
Particle physicsBackgroundNew particle2.87 × 10−7 (5σ)
Neyman-Pearson: Fix PFA, Maximize PD

Set the desired false alarm rate α. The threshold γ is computed automatically. Watch how PD depends on SNR.

False alarm rate α 0.050
SNR (dB) 6.0
The Neyman-Pearson lemma states that the test maximizing PD for a given PFA ≤ α is:

Chapter 6: Multiple Classes

So far we've focused on binary detection (K = 2). The Bayes classifier generalizes naturally to K > 2 classes. At each point x, pick the class k that maximizes the posterior ηk(x) = πk gk(x) / p(x), or equivalently, the numerator πk gk(x).

Decision Regions

The Bayes classifier partitions the feature space ℝN into K decision regions R1, R2, ..., RK:

Rk = {x : πk gk(x) ≥ πj gj(x) for all j ≠ k}

The boundaries between regions are where two classes tie: πk gk(x) = πj gj(x).

Gaussian Case: K Classes

If each class is Gaussian with mean μk and shared covariance Σ = σ2I:

gk(x) ∝ exp(−||x − μk||2 / (2σ2))

Taking the log of πk gk(x):

log(πk gk(x)) = log πk − ||x − μk||2 / (2σ2)

Expanding the quadratic:

= log πk − ||x||2/(2σ2) + xTμk2 − ||μk||2/(2σ2)

The ||x||2 term is the same for all k, so the classifier becomes:

f*(x) = argmaxk [xTμk2 − ||μk||2/(2σ2) + log πk]

This is linear in x! The decision boundaries between classes are hyperplanes. We'll see in Lecture 15 that this is exactly Linear Discriminant Analysis (LDA).

Key insight: For Gaussian classes with shared covariance, the Bayes classifier has linear decision boundaries. Each boundary is the perpendicular bisector of the line connecting two means (adjusted for priors). No optimization needed — just plug in μk, σ2, and πk.

Worked Example: 3 Classes in 2D

Three classes with equal priors πk = 1/3 and σ2 = 1:

• μ1 = [0, 0]T

• μ2 = [3, 0]T

• μ3 = [1.5, 2.5]T

Boundary between classes 1 and 2: xT1 − μ2) = (||μ1||2 − ||μ2||2)/2

• xT[−3, 0]T = (0 − 9)/2 → −3x1 = −4.5 → x1 = 1.5

A vertical line at x1 = 1.5. Makes sense — it's the midpoint.

Multi-Class Bayes Decision Regions

Three Gaussian classes in 2D with shared covariance σ2I. The decision boundaries are straight lines (hyperplanes in higher dimensions). Click to place a test point and see which class it's assigned to.

Spread σ 1.0
For K Gaussian classes with shared covariance σ2I and equal priors, the Bayes decision boundaries are:

Chapter 7: Mastery

Let's step back and see the full picture of what we've built.

The Bayes Classification Framework

Model
Prior πk, class-conditional gk(x)
Posterior
ηk(x) = πk gk(x) / p(x) via Bayes' rule
Bayes Classifier
f*(x) = argmaxk ηk(x) — minimum risk
Binary: LRT
Λ(x) = g2/g1 vs τ = π12
ROC & Neyman-Pearson
Sweep τ → ROC curve. Fix PFA = α → N-P optimal test

Key Results

ConceptFormulaKey Takeaway
Bayes riskR* = E[1 − maxk ηk(x)]Fundamental limit; no classifier can beat it
LRTΛ(x) ≷ π12All optimal binary tests are LRTs
Gaussian LRTxTμ ≷ ||μ||2/2 + σ2log(π12)Sufficient statistic = matched filter output
ROC(PFA, PD) as γ variesTrade-off curve between error types
Neyman-Pearsonmax PD s.t. PFA ≤ αLRT is optimal; just change γ

What Comes Next

This lecture assumed we KNOW the distributions gk(x). In practice, we estimate them from data. That leads to:

Lecture 15: Stationary signals, autocorrelation, and LDA/QDA — the parametric Gaussian classifiers

Lecture 16: Fisher's Linear Discriminant — finding the best projection without assuming Gaussianity

Limitations

Requires known distributions. Real-world distributions are unknown. LDA/QDA estimate them; non-parametric methods (k-NN, kernels) avoid distributional assumptions altogether.

Curse of dimensionality. In high dimensions, density estimation becomes exponentially hard. This motivates dimensionality reduction (PCA, Fisher) before classification.

0-1 loss only. We derived for equal-cost errors. Weighted losses lead to cost-sensitive Bayes classifiers with modified thresholds.

"All models are wrong, but some are useful." — George Box. The Bayes classifier is optimal given the model. If your model (priors + class-conditionals) is wrong, the Bayes classifier for that model won't be the true Bayes classifier. The art of classification is choosing a model that's wrong in the right ways.
The Bayes classifier and the Neyman-Pearson optimal test are both likelihood ratio tests. What differs?