The provably optimal classifier — and the likelihood ratio test that powers radar, medicine, and spam filters.
A radar station stares into the sky. Every second it receives a signal x — a vector of N samples. Sometimes the signal is pure noise (no aircraft). Sometimes there's a faint echo from a plane buried in that noise. The station must decide: target present or noise only?
Get it wrong in one direction: you scramble jets for nothing (false alarm). Get it wrong in the other: you miss an incoming threat (missed detection). The cost of each mistake is wildly different.
This is the binary classification problem. We have a feature vector x ∈ ℝN and must assign it to one of K classes. A classifier is a function f: ℝN → {1, 2, ..., K}. The question is: what's the best possible classifier?
We model the world probabilistically. A random label y ∈ {1, 2, ..., K} is drawn from a prior distribution πk = P(y = k). Then, given y = k, a random feature vector x is drawn from a class-conditional density gk(x) = p(x | y = k).
The classifier sees x but NOT y. It must guess y. Its risk (probability of error) is:
The Bayes risk R* is the smallest achievable risk over all classifiers:
No classifier — not neural networks, not SVMs, nothing — can achieve risk lower than R*. It's the fundamental limit set by the overlap between the class distributions.
Class 1 (teal) and Class 2 (orange) have overlapping densities. The overlap region is where errors are unavoidable. Drag the means apart to reduce the Bayes risk.
Let's derive the optimal classifier from scratch. We need to find f*(x) that minimizes R(f) = P(f(x) ≠ y). Start by writing out the risk using the law of total probability.
Given that we observe x, the posterior probability that the true class is k is:
where p(x) = ∑j πj gj(x) is the marginal density of x. This is just Bayes' rule applied to classification.
The three ingredients are:
• Prior πk = P(y = k) — how likely is class k before seeing any data?
• Class-conditional gk(x) = p(x | y = k) — if the class IS k, how likely is this x?
• Posterior ηk(x) = P(y = k | x) — given we saw x, how likely is class k?
The probability of error for a classifier f at a specific point x is:
Because if we assign class f(x), the probability that's correct is ηf(x)(x), and the error probability is 1 minus that. To minimize the total risk R(f) = Ex[P(error | x)], we minimize the integrand pointwise. At each x, we should choose the class with the highest posterior:
The denominator p(x) is the same for all k, so it doesn't affect the argmax. We only need the numerator πk · gk(x).
Two classes, equal priors (π1 = π2 = 0.5). Class 1: g1(x) = N(0, 1). Class 2: g2(x) = N(2, 1).
At x = 0.5:
• π1 g1(0.5) = 0.5 × 0.352 = 0.176
• π2 g2(0.5) = 0.5 × 0.130 = 0.065
• Class 1 wins. f*(0.5) = 1.
At x = 1.5:
• π1 g1(1.5) = 0.5 × 0.130 = 0.065
• π2 g2(1.5) = 0.5 × 0.352 = 0.176
• Class 2 wins. f*(1.5) = 2.
The boundary is at x = 1.0 (the midpoint of the means, since priors are equal and variances are equal). The Bayes risk here is R* = Φ(-1) ≈ 0.159, where Φ is the standard normal CDF.
If priors are unequal, the boundary shifts. With π1 = 0.9 and π2 = 0.1, class 1 gets a "head start" — the boundary moves right (toward μ2), so more of the space is assigned to the more likely class. This makes intuitive sense: if 90% of signals are noise, you should be reluctant to declare a detection.
Two Gaussian classes. The Bayes classifier picks the class with the higher posterior at each x. The decision boundary is where the posteriors cross. Adjust the prior to see the boundary shift.
Now let's apply the Bayes framework to the radar problem from Chapter 0. This is the signal detection problem — the Gaussian special case that shows up everywhere from radar to medical imaging to communications.
You receive a vector x = [x[0], x[1], ..., x[N-1]]T ∈ ℝN.
Hypothesis H1 (noise only): x ~ N(0, σ2I). Each sample x[n] is i.i.d. Gaussian noise with mean 0 and variance σ2.
Hypothesis H2 (signal + noise): x ~ N(μ, σ2I). The same noise, but now a known signal μ = [μ[0], ..., μ[N-1]]T is added.
Under H1 (noise only):
Under H2 (signal + noise):
Let μ = [1, 1, 1, 1]T and σ2 = 1. You observe x = [0.3, 0.8, 1.2, 0.5].
Compute the exponents:
• ||x||2 = 0.09 + 0.64 + 1.44 + 0.25 = 2.42
• ||x − μ||2 = 0.49 + 0.04 + 0.04 + 0.25 = 0.82
The normalization constants are the same for both, so comparing g1(x) vs g2(x) reduces to comparing:
• exp(−2.42/2) = exp(−1.21) = 0.298 (noise hypothesis)
• exp(−0.82/2) = exp(−0.41) = 0.664 (signal hypothesis)
Signal hypothesis wins — the data is much closer to μ than to 0. This makes intuitive sense: the observed values [0.3, 0.8, 1.2, 0.5] are clearly shifted positive, consistent with a signal of μ = [1,1,1,1].
Let's expand ||x − μ||2:
Compare g2(x) ≥ g1(x), which means ||x||2 ≥ ||x − μ||2:
The ||x||2 terms cancel!
For our constant-signal case μ = [A, A, ..., A]T, this becomes:
For our worked example: x̄ = (0.3 + 0.8 + 1.2 + 0.5)/4 = 0.7. Threshold = A/2 = 0.5. Since 0.7 ≥ 0.5, we declare "signal present." Matches our earlier calculation.
Received signal (white) vs known signal template μ (orange dashed). The test statistic xTμ compares how well x matches the template. Click "New Sample" to generate random observations.
The Bayes classifier for two classes reduces to a beautifully simple form: the likelihood ratio test (LRT). Let's derive it.
For K = 2 classes, the Bayes classifier assigns x to class 2 when π2 g2(x) > π1 g1(x). Rearranging:
The left side Λ(x) is the likelihood ratio — how many times more likely is x under H2 compared to H1. The right side τ is the threshold, determined by the priors.
In practice, we take the logarithm (since log is monotonic, it doesn't change the decision):
This is numerically more stable (avoids multiplying many small probabilities) and turns products into sums.
For our signal detection problem (H1: N(0, σ2I), H2: N(μ, σ2I)):
Since σ2 > 0, dividing by it preserves the inequality. The test becomes:
With equal priors (π1 = π2), the log term vanishes and we recover xTμ ≥ ||μ||2/2 from Chapter 2.
μ = [2, 0]T, σ2 = 1, π1 = 0.7, π2 = 0.3.
• ||μ||2/2 = 4/2 = 2
• σ2 log(π1/π2) = 1 × log(0.7/0.3) = log(2.33) ≈ 0.847
• Threshold on xTμ: 2 + 0.847 = 2.847
For x = [1.5, 0.3]T: xTμ = 1.5 × 2 + 0.3 × 0 = 3.0 ≥ 2.847. Declare H2 (signal present).
For x = [1.2, −0.5]T: xTμ = 1.2 × 2 + (−0.5) × 0 = 2.4 < 2.847. Declare H1 (noise).
The two Gaussian densities and the likelihood ratio Λ(x). The vertical line shows the threshold τ. Regions left/right of the threshold are decision regions for H1/H2.
The Bayes classifier uses a threshold determined by the priors. But what if you don't know the priors? Or what if the costs of different errors are unequal? The ROC curve (Receiver Operating Characteristic) shows the full trade-off between the two types of errors as you sweep the threshold.
In binary detection (H1 = noise, H2 = signal):
| Decide H1 | Decide H2 | |
|---|---|---|
| Truth: H1 | Correct rejection | False alarm (Type I) |
| Truth: H2 | Missed detection (Type II) | Detection (hit) |
Define the key rates:
• PFA = P(decide H2 | H1 is true) = probability of false alarm
• PD = P(decide H2 | H2 is true) = probability of detection
We want PD high (detect real signals) and PFA low (don't trigger on noise). But they're in tension: lowering the threshold detects more signals but also creates more false alarms.
For the Gaussian LRT with test statistic T(x) = xTμ and threshold γ:
• Under H1: T ~ N(0, σ2||μ||2), so PFA = P(T ≥ γ | H1) = Q(γ / (σ||μ||))
• Under H2: T ~ N(||μ||2, σ2||μ||2), so PD = P(T ≥ γ | H2) = Q((γ − ||μ||2) / (σ||μ||))
where Q(z) = P(Z ≥ z) for Z ~ N(0,1) is the Q-function (complementary CDF).
As γ sweeps from −∞ to +∞:
• γ = −∞: always declare H2, so PFA = PD = 1 (top-right corner)
• γ = +∞: always declare H1, so PFA = PD = 0 (origin)
• Intermediate γ: a curve in the (PFA, PD) plane
The shape of the ROC depends on the signal-to-noise ratio (SNR):
Higher SNR pushes the ROC curve toward the top-left corner. At infinite SNR, the signal is so strong that you can achieve PD = 1 with PFA = 0. At zero SNR (μ = 0), the curve is the diagonal — the signal is indistinguishable from noise.
For N samples of a constant signal with amplitude A in noise with variance σ2:
More samples improve detection linearly! Doubling N is equivalent to doubling A2. This is the processing gain of matched filtering.
Drag the threshold slider to move along the ROC curve. Watch how PFA and PD change simultaneously. Adjust SNR to see the curve shape change.
The Bayes framework assumes you know the priors π1, π2. But in many applications, the priors are unknown or irrelevant. In radar, you don't care what fraction of time slots contain targets — you care about not missing a target while keeping false alarms manageable.
The Neyman-Pearson approach takes a different perspective: maximize the probability of detection PD subject to a constraint on the false alarm rate PFA ≤ α.
This is a profound result. It says the same test structure (compare the likelihood ratio to a threshold) is optimal under two completely different criteria:
1. Minimizing overall error probability (Bayes, threshold set by priors)
2. Maximizing detection for a fixed false alarm rate (Neyman-Pearson, threshold set by α)
Only the threshold value changes. The sufficient statistic is the same.
For the Gaussian case, the test statistic T = xTμ has distribution N(0, σ2||μ||2) under H1. We need:
So the threshold is:
where Q−1 is the inverse Q-function.
α = 0.01 (1% false alarm rate), σ = 1, ||μ|| = 2.
• Q−1(0.01) = 2.326
• γ = 1 × 2 × 2.326 = 4.652
• Resulting PD = Q((4.652 − 4)/(1 × 2)) = Q(0.326) ≈ 0.372
At 1% false alarm rate with SNR = 4, we only detect 37% of targets. Harsh. To improve: increase SNR (more power, more samples, or less noise).
If we relax to α = 0.05:
• γ = 2 × 1.645 = 3.290
• PD = Q((3.290 − 4)/2) = Q(−0.355) = 1 − Q(0.355) ≈ 0.639
Allowing 5% false alarms nearly doubles our detection rate.
| Application | H1 | H2 | Typical α |
|---|---|---|---|
| Radar | No target | Target present | 10−6 |
| Medical screening | Healthy | Disease | 0.05 |
| Spam filter | Legitimate | Spam | 0.01 |
| Particle physics | Background | New particle | 2.87 × 10−7 (5σ) |
Set the desired false alarm rate α. The threshold γ is computed automatically. Watch how PD depends on SNR.
So far we've focused on binary detection (K = 2). The Bayes classifier generalizes naturally to K > 2 classes. At each point x, pick the class k that maximizes the posterior ηk(x) = πk gk(x) / p(x), or equivalently, the numerator πk gk(x).
The Bayes classifier partitions the feature space ℝN into K decision regions R1, R2, ..., RK:
The boundaries between regions are where two classes tie: πk gk(x) = πj gj(x).
If each class is Gaussian with mean μk and shared covariance Σ = σ2I:
Taking the log of πk gk(x):
Expanding the quadratic:
The ||x||2 term is the same for all k, so the classifier becomes:
This is linear in x! The decision boundaries between classes are hyperplanes. We'll see in Lecture 15 that this is exactly Linear Discriminant Analysis (LDA).
Three classes with equal priors πk = 1/3 and σ2 = 1:
• μ1 = [0, 0]T
• μ2 = [3, 0]T
• μ3 = [1.5, 2.5]T
Boundary between classes 1 and 2: xT(μ1 − μ2) = (||μ1||2 − ||μ2||2)/2
• xT[−3, 0]T = (0 − 9)/2 → −3x1 = −4.5 → x1 = 1.5
A vertical line at x1 = 1.5. Makes sense — it's the midpoint.
Three Gaussian classes in 2D with shared covariance σ2I. The decision boundaries are straight lines (hyperplanes in higher dimensions). Click to place a test point and see which class it's assigned to.
Let's step back and see the full picture of what we've built.
| Concept | Formula | Key Takeaway |
|---|---|---|
| Bayes risk | R* = E[1 − maxk ηk(x)] | Fundamental limit; no classifier can beat it |
| LRT | Λ(x) ≷ π1/π2 | All optimal binary tests are LRTs |
| Gaussian LRT | xTμ ≷ ||μ||2/2 + σ2log(π1/π2) | Sufficient statistic = matched filter output |
| ROC | (PFA, PD) as γ varies | Trade-off curve between error types |
| Neyman-Pearson | max PD s.t. PFA ≤ α | LRT is optimal; just change γ |
This lecture assumed we KNOW the distributions gk(x). In practice, we estimate them from data. That leads to:
• Lecture 15: Stationary signals, autocorrelation, and LDA/QDA — the parametric Gaussian classifiers
• Lecture 16: Fisher's Linear Discriminant — finding the best projection without assuming Gaussianity
• Requires known distributions. Real-world distributions are unknown. LDA/QDA estimate them; non-parametric methods (k-NN, kernels) avoid distributional assumptions altogether.
• Curse of dimensionality. In high dimensions, density estimation becomes exponentially hard. This motivates dimensionality reduction (PCA, Fisher) before classification.
• 0-1 loss only. We derived for equal-cost errors. Weighted losses lead to cost-sensitive Bayes classifiers with modified thresholds.