EE269 Lecture 14 — Bayes Classifiers & Detection

Chapter 0: The Classification Problem

A radar station stares into the sky. Every second it receives a signal x — a vector of N samples. Sometimes the signal is pure noise (no aircraft). Sometimes there's a faint echo from a plane buried in that noise. The station must decide: target present or noise only?

Get it wrong in one direction: you scramble jets for nothing (false alarm). Get it wrong in the other: you miss an incoming threat (missed detection). The cost of each mistake is wildly different.

This is the binary classification problem. We have a feature vector x ∈ ℝ^N and must assign it to one of K classes. A classifier is a function f: ℝ^N → {1, 2, ..., K}. The question is: what's the best possible classifier?

The central question: Among ALL possible classifiers — every conceivable function from ℝ^N to {1,...,K} — is there one that makes the fewest mistakes? Yes. It's called the Bayes classifier, and it's provably optimal. No other classifier, no matter how sophisticated, can beat it.

Formal Setup

We model the world probabilistically. A random label y ∈ {1, 2, ..., K} is drawn from a prior distribution π_k = P(y = k). Then, given y = k, a random feature vector x is drawn from a class-conditional density g_k(x) = p(x | y = k).

The classifier sees x but NOT y. It must guess y. Its risk (probability of error) is:

R(f) = P(f(x) ≠ y)

The Bayes risk R* is the smallest achievable risk over all classifiers:

R* = inf_f R(f)

No classifier — not neural networks, not SVMs, nothing — can achieve risk lower than R*. It's the fundamental limit set by the overlap between the class distributions.

Two Classes with Overlapping Distributions

Class 1 (teal) and Class 2 (orange) have overlapping densities. The overlap region is where errors are unavoidable. Drag the means apart to reduce the Bayes risk.

Class 2 mean μ₂ 2.0

Why can't any classifier achieve zero risk when two class distributions overlap?

Because classifiers are limited to linear boundaries Because in the overlap region, both classes can produce the same x, so any decision will sometimes be wrong Because we don't have enough training data

Chapter 1: Bayes Risk

Let's derive the optimal classifier from scratch. We need to find f*(x) that minimizes R(f) = P(f(x) ≠ y). Start by writing out the risk using the law of total probability.

The Posterior

Given that we observe x, the posterior probability that the true class is k is:

η_k(x) = P(y = k | x) = π_k · g_k(x) / p(x)

where p(x) = ∑_j π_j g_j(x) is the marginal density of x. This is just Bayes' rule applied to classification.

The three ingredients are:

• Prior π_k = P(y = k) — how likely is class k before seeing any data?

• Class-conditional g_k(x) = p(x | y = k) — if the class IS k, how likely is this x?

• Posterior η_k(x) = P(y = k | x) — given we saw x, how likely is class k?

Deriving the Bayes Classifier

The probability of error for a classifier f at a specific point x is:

P(error | x) = P(f(x) ≠ y | x) = 1 − η_f(x)(x)

Because if we assign class f(x), the probability that's correct is η_f(x)(x), and the error probability is 1 minus that. To minimize the total risk R(f) = E_x[P(error | x)], we minimize the integrand pointwise. At each x, we should choose the class with the highest posterior:

f*(x) = argmax_k η_k(x) = argmax_k π_k · g_k(x)

The denominator p(x) is the same for all k, so it doesn't affect the argmax. We only need the numerator π_k · g_k(x).

The Bayes classifier: At every point x, pick the class k that maximizes π_k · g_k(x). This is optimal — no other classifier has lower risk. The Bayes risk is R* = E_x[1 − max_k η_k(x)].

Worked Example

Two classes, equal priors (π₁ = π₂ = 0.5). Class 1: g₁(x) = N(0, 1). Class 2: g₂(x) = N(2, 1).

At x = 0.5:

• π₁ g₁(0.5) = 0.5 × 0.352 = 0.176

• π₂ g₂(0.5) = 0.5 × 0.130 = 0.065

• Class 1 wins. f*(0.5) = 1.

At x = 1.5:

• π₁ g₁(1.5) = 0.5 × 0.130 = 0.065

• π₂ g₂(1.5) = 0.5 × 0.352 = 0.176

• Class 2 wins. f*(1.5) = 2.

The boundary is at x = 1.0 (the midpoint of the means, since priors are equal and variances are equal). The Bayes risk here is R* = Φ(-1) ≈ 0.159, where Φ is the standard normal CDF.

Effect of Priors

If priors are unequal, the boundary shifts. With π₁ = 0.9 and π₂ = 0.1, class 1 gets a "head start" — the boundary moves right (toward μ₂), so more of the space is assigned to the more likely class. This makes intuitive sense: if 90% of signals are noise, you should be reluctant to declare a detection.

Bayes Classifier: Posterior vs. Decision Boundary

Two Gaussian classes. The Bayes classifier picks the class with the higher posterior at each x. The decision boundary is where the posteriors cross. Adjust the prior to see the boundary shift.

Prior π₁ 0.50

Class 2 mean μ₂ 2.5

If π₁ = 0.9 and π₂ = 0.1 with equal Gaussian variances, the Bayes decision boundary:

Stays at the midpoint of the means Shifts toward μ₁ (the more likely class) Shifts toward μ₂ (the less likely class), requiring stronger evidence to declare class 2

Chapter 2: Signal Detection

Now let's apply the Bayes framework to the radar problem from Chapter 0. This is the signal detection problem — the Gaussian special case that shows up everywhere from radar to medical imaging to communications.

The Model

You receive a vector x = [x[0], x[1], ..., x[N-1]]^T ∈ ℝ^N.

Hypothesis H₁ (noise only): x ~ N(0, σ²I). Each sample x[n] is i.i.d. Gaussian noise with mean 0 and variance σ².

Hypothesis H₂ (signal + noise): x ~ N(μ, σ²I). The same noise, but now a known signal μ = [μ[0], ..., μ[N-1]]^T is added.

Under H₁ (noise only):

g₁(x) = (2πσ²)^-N/2 exp(−||x||² / (2σ²))

Under H₂ (signal + noise):

g₂(x) = (2πσ²)^-N/2 exp(−||x − μ||² / (2σ²))

Worked Example: N = 4 Samples

Let μ = [1, 1, 1, 1]^T and σ² = 1. You observe x = [0.3, 0.8, 1.2, 0.5].

Compute the exponents:

• ||x||² = 0.09 + 0.64 + 1.44 + 0.25 = 2.42

• ||x − μ||² = 0.49 + 0.04 + 0.04 + 0.25 = 0.82

The normalization constants are the same for both, so comparing g₁(x) vs g₂(x) reduces to comparing:

• exp(−2.42/2) = exp(−1.21) = 0.298 (noise hypothesis)

• exp(−0.82/2) = exp(−0.41) = 0.664 (signal hypothesis)

Signal hypothesis wins — the data is much closer to μ than to 0. This makes intuitive sense: the observed values [0.3, 0.8, 1.2, 0.5] are clearly shifted positive, consistent with a signal of μ = [1,1,1,1].

Key insight: The Bayes classifier for Gaussian signal detection is comparing distances: ||x||² vs ||x − μ||². Is x closer to the origin (noise) or closer to μ (signal)? The decision boundary is the perpendicular bisector of the line from 0 to μ.

Simplifying the Comparison

Let's expand ||x − μ||²:

||x − μ||² = ||x||² − 2x^Tμ + ||μ||²

Compare g₂(x) ≥ g₁(x), which means ||x||² ≥ ||x − μ||²:

||x||² ≥ ||x||² − 2x^Tμ + ||μ||²

The ||x||² terms cancel!

2x^Tμ ≥ ||μ||²

x^Tμ ≥ ||μ||² / 2

For our constant-signal case μ = [A, A, ..., A]^T, this becomes:

A · ∑_n x[n] ≥ N · A² / 2

x̄ = (1/N) ∑_n x[n] ≥ A/2

The sufficient statistic: For Gaussian detection with a constant signal μ = A·1, the optimal test only depends on the sample mean x̄. You don't need to look at each individual sample — just their average! The sample mean is a sufficient statistic for this problem. It compresses N numbers into one number with zero loss of information (for this decision).

For our worked example: x̄ = (0.3 + 0.8 + 1.2 + 0.5)/4 = 0.7. Threshold = A/2 = 0.5. Since 0.7 ≥ 0.5, we declare "signal present." Matches our earlier calculation.

Signal Detection in Noise

Received signal (white) vs known signal template μ (orange dashed). The test statistic x^Tμ compares how well x matches the template. Click "New Sample" to generate random observations.

Signal amplitude A 1.5

Noise σ 1.0

For Gaussian signal detection with μ = [A, A, ..., A]^T in N(0, σ²I) noise, the sufficient statistic is:

The maximum sample value max(x[n]) The sample mean x̄ = (1/N)∑x[n] The sample variance

Chapter 3: Likelihood Ratio Test

The Bayes classifier for two classes reduces to a beautifully simple form: the likelihood ratio test (LRT). Let's derive it.

Derivation

For K = 2 classes, the Bayes classifier assigns x to class 2 when π₂ g₂(x) > π₁ g₁(x). Rearranging:

Λ(x) = g₂(x) / g₁(x) ≥ π₁ / π₂ = τ

The left side Λ(x) is the likelihood ratio — how many times more likely is x under H₂ compared to H₁. The right side τ is the threshold, determined by the priors.

The Likelihood Ratio Test: Compute the ratio of how likely x is under each hypothesis. If this ratio exceeds the prior-determined threshold, declare H₂. That's it. Every Bayes-optimal binary classifier is a likelihood ratio test.

Log-Likelihood Ratio

In practice, we take the logarithm (since log is monotonic, it doesn't change the decision):

log Λ(x) = log g₂(x) − log g₁(x) ≥ log τ

This is numerically more stable (avoids multiplying many small probabilities) and turns products into sums.

Gaussian LRT

For our signal detection problem (H₁: N(0, σ²I), H₂: N(μ, σ²I)):

log Λ(x) = log g₂(x) − log g₁(x)

= −||x − μ||²/(2σ²) + ||x||²/(2σ²)

= (2x^Tμ − ||μ||²) / (2σ²)

= (x^Tμ − ||μ||²/2) / σ²

Since σ² > 0, dividing by it preserves the inequality. The test becomes:

x^Tμ ≥ ||μ||²/2 + σ² log(π₁/π₂)

With equal priors (π₁ = π₂), the log term vanishes and we recover x^Tμ ≥ ||μ||²/2 from Chapter 2.

Worked Example

μ = [2, 0]^T, σ² = 1, π₁ = 0.7, π₂ = 0.3.

• ||μ||²/2 = 4/2 = 2

• σ² log(π₁/π₂) = 1 × log(0.7/0.3) = log(2.33) ≈ 0.847

• Threshold on x^Tμ: 2 + 0.847 = 2.847

For x = [1.5, 0.3]^T: x^Tμ = 1.5 × 2 + 0.3 × 0 = 3.0 ≥ 2.847. Declare H₂ (signal present).

For x = [1.2, −0.5]^T: x^Tμ = 1.2 × 2 + (−0.5) × 0 = 2.4 < 2.847. Declare H₁ (noise).

The role of priors: When π₁ = 0.7 (noise is more common), the threshold shifts up — you need stronger evidence (higher x^Tμ) to declare "signal present." The prior encodes your default belief. Unequal priors make the test conservative toward the more likely hypothesis.

Likelihood Ratio Test Visualized

The two Gaussian densities and the likelihood ratio Λ(x). The vertical line shows the threshold τ. Regions left/right of the threshold are decision regions for H₁/H₂.

Prior π₁ 0.50

μ₂ 2.5

The likelihood ratio Λ(x) = g₂(x)/g₁(x) measures:

How many times more likely x is under H₂ compared to H₁ The probability that H₂ is true The distance between x and the decision boundary

Chapter 4: ROC Curves

The Bayes classifier uses a threshold determined by the priors. But what if you don't know the priors? Or what if the costs of different errors are unequal? The ROC curve (Receiver Operating Characteristic) shows the full trade-off between the two types of errors as you sweep the threshold.

Two Types of Errors

In binary detection (H₁ = noise, H₂ = signal):

	Decide H₁	Decide H₂
Truth: H₁	Correct rejection	False alarm (Type I)
Truth: H₂	Missed detection (Type II)	Detection (hit)

Define the key rates:

• P_FA = P(decide H₂ | H₁ is true) = probability of false alarm

• P_D = P(decide H₂ | H₂ is true) = probability of detection

We want P_D high (detect real signals) and P_FA low (don't trigger on noise). But they're in tension: lowering the threshold detects more signals but also creates more false alarms.

Constructing the ROC

For the Gaussian LRT with test statistic T(x) = x^Tμ and threshold γ:

• Under H₁: T ~ N(0, σ²||μ||²), so P_FA = P(T ≥ γ | H₁) = Q(γ / (σ||μ||))

• Under H₂: T ~ N(||μ||², σ²||μ||²), so P_D = P(T ≥ γ | H₂) = Q((γ − ||μ||²) / (σ||μ||))

where Q(z) = P(Z ≥ z) for Z ~ N(0,1) is the Q-function (complementary CDF).

As γ sweeps from −∞ to +∞:

• γ = −∞: always declare H₂, so P_FA = P_D = 1 (top-right corner)

• γ = +∞: always declare H₁, so P_FA = P_D = 0 (origin)

• Intermediate γ: a curve in the (P_FA, P_D) plane

Reading an ROC curve: The curve goes from (0,0) to (1,1). A perfect detector hugs the top-left corner (P_D = 1, P_FA = 0). A random coin flip gives the diagonal line P_D = P_FA. Any useful detector lies above the diagonal. The area under the ROC curve (AUC) summarizes performance: AUC = 1 is perfect, AUC = 0.5 is random guessing.

SNR and the ROC

The shape of the ROC depends on the signal-to-noise ratio (SNR):

SNR = ||μ||² / σ²

Higher SNR pushes the ROC curve toward the top-left corner. At infinite SNR, the signal is so strong that you can achieve P_D = 1 with P_FA = 0. At zero SNR (μ = 0), the curve is the diagonal — the signal is indistinguishable from noise.

For N samples of a constant signal with amplitude A in noise with variance σ²:

SNR = N · A² / σ²

More samples improve detection linearly! Doubling N is equivalent to doubling A². This is the processing gain of matched filtering.

Interactive ROC Curve

Drag the threshold slider to move along the ROC curve. Watch how P_FA and P_D change simultaneously. Adjust SNR to see the curve shape change.

Threshold γ 1.50

SNR (dB) 6.0

On an ROC curve, a random-guessing classifier (coin flip) lies along:

The top edge (P_D = 1 for all P_FA) The diagonal from (0,0) to (1,1) The bottom edge (P_D = 0 for all P_FA)

Chapter 5: Neyman-Pearson

The Bayes framework assumes you know the priors π₁, π₂. But in many applications, the priors are unknown or irrelevant. In radar, you don't care what fraction of time slots contain targets — you care about not missing a target while keeping false alarms manageable.

The Neyman-Pearson approach takes a different perspective: maximize the probability of detection P_D subject to a constraint on the false alarm rate P_FA ≤ α.

The Neyman-Pearson Lemma

Neyman-Pearson Lemma: Among all tests with P_FA ≤ α, the test that maximizes P_D is the likelihood ratio test: declare H₂ when Λ(x) ≥ τ, where τ is chosen so that P_FA = α exactly. In other words: the LRT is not just Bayes-optimal — it's also Neyman-Pearson optimal!

This is a profound result. It says the same test structure (compare the likelihood ratio to a threshold) is optimal under two completely different criteria:

1. Minimizing overall error probability (Bayes, threshold set by priors)

2. Maximizing detection for a fixed false alarm rate (Neyman-Pearson, threshold set by α)

Only the threshold value changes. The sufficient statistic is the same.

Setting the Threshold

For the Gaussian case, the test statistic T = x^Tμ has distribution N(0, σ²||μ||²) under H₁. We need:

P_FA = P(T ≥ γ | H₁) = Q(γ / (σ||μ||)) = α

So the threshold is:

γ = σ ||μ|| · Q⁻¹(α)

where Q⁻¹ is the inverse Q-function.

Worked Example

α = 0.01 (1% false alarm rate), σ = 1, ||μ|| = 2.

• Q⁻¹(0.01) = 2.326

• γ = 1 × 2 × 2.326 = 4.652

• Resulting P_D = Q((4.652 − 4)/(1 × 2)) = Q(0.326) ≈ 0.372

At 1% false alarm rate with SNR = 4, we only detect 37% of targets. Harsh. To improve: increase SNR (more power, more samples, or less noise).

If we relax to α = 0.05:

• γ = 2 × 1.645 = 3.290

• P_D = Q((3.290 − 4)/2) = Q(−0.355) = 1 − Q(0.355) ≈ 0.639

Allowing 5% false alarms nearly doubles our detection rate.

The fundamental trade-off: On the ROC curve, the Neyman-Pearson test picks the point where P_FA = α. Moving left on the ROC (lower α) decreases both P_FA and P_D. There's no free lunch — reducing one type of error increases the other. The ROC curve IS the set of achievable (P_FA, P_D) pairs.

Applications

Application	H₁	H₂	Typical α
Radar	No target	Target present	10⁻⁶
Medical screening	Healthy	Disease	0.05
Spam filter	Legitimate	Spam	0.01
Particle physics	Background	New particle	2.87 × 10⁻⁷ (5σ)

Neyman-Pearson: Fix P_FA, Maximize P_D

Set the desired false alarm rate α. The threshold γ is computed automatically. Watch how P_D depends on SNR.

False alarm rate α 0.050

SNR (dB) 6.0

The Neyman-Pearson lemma states that the test maximizing P_D for a given P_FA ≤ α is:

The likelihood ratio test with threshold chosen so P_FA = α A nearest-neighbor classifier A test that always declares H₂

Chapter 6: Multiple Classes

So far we've focused on binary detection (K = 2). The Bayes classifier generalizes naturally to K > 2 classes. At each point x, pick the class k that maximizes the posterior η_k(x) = π_k g_k(x) / p(x), or equivalently, the numerator π_k g_k(x).

Decision Regions

The Bayes classifier partitions the feature space ℝ^N into K decision regions R₁, R₂, ..., R_K:

R_k = {x : π_k g_k(x) ≥ π_j g_j(x) for all j ≠ k}

The boundaries between regions are where two classes tie: π_k g_k(x) = π_j g_j(x).

Gaussian Case: K Classes

If each class is Gaussian with mean μ_k and shared covariance Σ = σ²I:

g_k(x) ∝ exp(−||x − μ_k||² / (2σ²))

Taking the log of π_k g_k(x):

log(π_k g_k(x)) = log π_k − ||x − μ_k||² / (2σ²)

Expanding the quadratic:

= log π_k − ||x||²/(2σ²) + x^Tμ_k/σ² − ||μ_k||²/(2σ²)

The ||x||² term is the same for all k, so the classifier becomes:

f*(x) = argmax_k [x^Tμ_k/σ² − ||μ_k||²/(2σ²) + log π_k]

This is linear in x! The decision boundaries between classes are hyperplanes. We'll see in Lecture 15 that this is exactly Linear Discriminant Analysis (LDA).

Key insight: For Gaussian classes with shared covariance, the Bayes classifier has linear decision boundaries. Each boundary is the perpendicular bisector of the line connecting two means (adjusted for priors). No optimization needed — just plug in μ_k, σ², and π_k.

Worked Example: 3 Classes in 2D

Three classes with equal priors π_k = 1/3 and σ² = 1:

• μ₁ = [0, 0]^T

• μ₂ = [3, 0]^T

• μ₃ = [1.5, 2.5]^T

Boundary between classes 1 and 2: x^T(μ₁ − μ₂) = (||μ₁||² − ||μ₂||²)/2

• x^T[−3, 0]^T = (0 − 9)/2 → −3x₁ = −4.5 → x₁ = 1.5

A vertical line at x₁ = 1.5. Makes sense — it's the midpoint.

Multi-Class Bayes Decision Regions

Three Gaussian classes in 2D with shared covariance σ²I. The decision boundaries are straight lines (hyperplanes in higher dimensions). Click to place a test point and see which class it's assigned to.

Spread σ 1.0

For K Gaussian classes with shared covariance σ²I and equal priors, the Bayes decision boundaries are:

Linear (hyperplanes) — perpendicular bisectors between class means Quadratic (ellipses or parabolas) Arbitrary nonlinear curves

Chapter 7: Mastery

Let's step back and see the full picture of what we've built.

The Bayes Classification Framework

Model

Prior π_k, class-conditional g_k(x)

↓

Posterior

η_k(x) = π_k g_k(x) / p(x) via Bayes' rule

↓

Bayes Classifier

f*(x) = argmax_k η_k(x) — minimum risk

↓

Binary: LRT

Λ(x) = g₂/g₁ vs τ = π₁/π₂

↓

ROC & Neyman-Pearson

Sweep τ → ROC curve. Fix P_FA = α → N-P optimal test

Key Results

Concept	Formula	Key Takeaway
Bayes risk	R* = E[1 − max_k η_k(x)]	Fundamental limit; no classifier can beat it
LRT	Λ(x) ≷ π₁/π₂	All optimal binary tests are LRTs
Gaussian LRT	x^Tμ ≷ \|\|μ\|\|²/2 + σ²log(π₁/π₂)	Sufficient statistic = matched filter output
ROC	(P_FA, P_D) as γ varies	Trade-off curve between error types
Neyman-Pearson	max P_D s.t. P_FA ≤ α	LRT is optimal; just change γ

What Comes Next

This lecture assumed we KNOW the distributions g_k(x). In practice, we estimate them from data. That leads to:

• Lecture 15: Stationary signals, autocorrelation, and LDA/QDA — the parametric Gaussian classifiers

• Lecture 16: Fisher's Linear Discriminant — finding the best projection without assuming Gaussianity

Limitations

• Requires known distributions. Real-world distributions are unknown. LDA/QDA estimate them; non-parametric methods (k-NN, kernels) avoid distributional assumptions altogether.

• Curse of dimensionality. In high dimensions, density estimation becomes exponentially hard. This motivates dimensionality reduction (PCA, Fisher) before classification.

• 0-1 loss only. We derived for equal-cost errors. Weighted losses lead to cost-sensitive Bayes classifiers with modified thresholds.

"All models are wrong, but some are useful." — George Box. The Bayes classifier is optimal given the model. If your model (priors + class-conditionals) is wrong, the Bayes classifier for that model won't be the true Bayes classifier. The art of classification is choosing a model that's wrong in the right ways.

The Bayes classifier and the Neyman-Pearson optimal test are both likelihood ratio tests. What differs?

The test statistic (they use different functions of x) The Bayes test uses priors; Neyman-Pearson doesn't apply to Gaussians Only the threshold: Bayes sets it from priors, Neyman-Pearson from the allowed false alarm rate α