Ch 11: General Inference — Piech Probability CS

Chapter 0: Why General Inference?

You have built a Bayesian network with dozens of random variables — diseases, symptoms, demographics, lab results. The graph encodes the full joint distribution compactly. But now someone asks: "Given that the patient has a fever and is a university student, what is the probability they have the flu?" How do you actually compute that answer?

Analytic computation — writing out the conditional probability formula and summing over all assignments — is possible in theory but explodes combinatorially. With 20 binary variables, the joint table has 2²⁰ ≈ 1 million cells. With continuous variables, the sum becomes an integral over infinite values. We need a better way.

General inference is the art of answering arbitrary probability queries on a Bayesian network. The key insight: instead of computing exact probabilities, we can sample thousands of particles from the joint distribution and then count. This is called rejection sampling, and it is one of the great practical ideas in probabilistic AI.

The core idea: Generate millions of joint samples (particles) from your Bayesian network. To answer any conditional probability question, simply filter the particles to those consistent with your condition, then count how many also match your query. Sampling turns hard math into easy counting.

But sampling is not the end of the story. Once you can answer probability questions, a deeper question emerges: are your answers fair? If an algorithm uses this network to decide who gets a loan or who gets flagged for extra screening, does it treat different demographic groups equitably?

This chapter covers two intertwined topics. First, we develop rejection sampling and understand its limits. Then we introduce three formal definitions of fairness — parity, calibration, and equality of odds — and discover a startling result: you cannot satisfy all three at once.

Topic	Key Question
Rejection Sampling	How do I compute P(X\|Y) on a large Bayes net?
Likelihood Weighting	What if the condition is too rare for rejection sampling?
Parity	Does the algorithm predict positive at the same rate for all groups?
Calibration	Is the algorithm equally accurate for all groups?
Equality of Odds	Among truly positive cases, does it catch them equally across groups?
Impossibility Theorem	Can any algorithm satisfy all three?

Real-world impact: General inference runs inside thousands of hospital diagnostic systems, spam filters, and recommendation engines right now. And fairness is not academic — in 2018, Buolamwini and Gebru showed that production facial recognition systems at IBM, Microsoft, and Face++ had accuracy gaps of over 34 percentage points between lighter-skinned men and darker-skinned women.

Check: Why can't we just compute conditional probabilities analytically on a large Bayesian network?

The Bayes net is too hard to draw Summing over all variable assignments grows exponentially in the number of variables Conditional probabilities are undefined for Bayes nets

Chapter 1: The Inference Framework

Let's start with the simplest inference technique: rejection sampling. The idea is almost embarrassingly simple. You generate a huge number of particles — joint samples from all the variables in your Bayes net — and then answer questions by counting.

To generate one particle, sample each variable in topological order (parents before children). For each variable, sample its value from the conditional distribution given its parents' already-sampled values. This is called ancestral sampling.

Here is a concrete example using a Simple Disease Model with four variables: Uni (is a university student), Influenza, Fever, and Tired.

Step 1: Sample Uni

P(Uni = 1) = 0.8. Draw from Bernoulli(0.8). Got Uni = 1.

↓

Step 2: Sample Influenza

P(Flu = 1 | Uni = 1) = 0.2. Draw from Bernoulli(0.2). Got Flu = 0.

↓

Step 3: Sample Fever

P(Fever = 1 | Flu = 0) = 0.05. Draw from Bernoulli(0.05). Got Fever = 0.

↓

Step 4: Sample Tired

P(Tired = 1 | Uni = 1, Flu = 0) = 0.8. Draw from Bernoulli(0.8). Got Tired = 0.

This produces one particle: [Uni=1, Flu=0, Fever=0, Tired=0]. Each particle's probability equals the product of the conditionals used to generate it: P(Uni=1) × P(Flu=0|Uni=1) × P(Fever=0|Flu=0) × P(Tired=0|Uni=1,Flu=0) = 0.8 × 0.8 × 0.95 × 0.2 = 0.1216. But we don't need to compute this — the beauty of sampling is that particles appear with frequency proportional to their probability automatically.

Repeat N = 100,000 times and you have a collection of particles that approximates the joint distribution.

Now the magic: to compute any probability, just count.

P(X = k) ≈ N(X = k) / N

Where N(X = k) is the number of particles where variables X take values k. For conditional probabilities, use the definition:

P(X = a | Y = b) = P(X = a, Y = b) / P(Y = b) = N(X = a, Y = b) / N(Y = b)

In words: filter your particles to those where Y = b (the condition). Among those, count how many also have X = a. That fraction is your answer.

Key insight: Rejection sampling turns any conditional probability question into a counting exercise. Filter, then count. The "rejection" part is the filtering — you reject (throw away) every particle that is inconsistent with your condition.

python
def rejection_sampling(query, condition, N=100000):
    # Generate N particles from the Bayes net
    particles = [sample_particle() for _ in range(N)]
    # Reject particles inconsistent with condition
    consistent = [p for p in particles if matches(p, condition)]
    # Count those also matching query
    hits = sum(1 for p in consistent if matches(p, query))
    return hits / len(consistent)

Check: In rejection sampling, what does "rejection" refer to?

Rejecting the Bayesian network if it has too many variables Discarding sampled particles that are inconsistent with the condition Rejecting conditional probabilities less than 0.5

Chapter 2: Prior & Posterior

Before we see any evidence, we have a prior belief about the world. In our disease model, the prior probability of having the flu might be P(Flu = 1) = 0.2 for a university student. This is what we believe before taking any measurements.

After observing evidence — say, the patient has a fever of 101 — our belief updates. The updated belief is called the posterior: P(Flu = 1 | Fever = 101). This is what we believe after (post) incorporating evidence.

P(Flu = 1 | Fever = 101) ≠ P(Flu = 1)

The posterior is almost always different from the prior, because evidence changes what we believe. The whole point of inference is computing this posterior.

Prior vs. posterior: Think of the prior as your initial guess before the exam results come back. The posterior is your updated belief after seeing the grade. Evidence moves you from prior to posterior. Bayesian inference is this update, repeated as many times as you have pieces of evidence.

In a Bayesian network, the prior for any variable is what you get by marginalizing the joint distribution over all other variables. The posterior is what you get after conditioning on observed evidence. Rejection sampling gives us both: the raw particle counts approximate the prior, and filtered counts approximate the posterior.

Mathematically, the update from prior to posterior follows Bayes' theorem:

P(H | E) = P(E | H) · P(H) / P(E)

Where P(H) is the prior, P(E|H) is the likelihood (how probable the evidence is if H is true), and P(E) is the marginal probability of the evidence (a normalizing constant). The posterior P(H|E) is the updated belief. Rejection sampling computes this ratio implicitly through counting.

Worked example: Suppose we run 10,000 particles through the Simple Disease Model and get:

Flu	Fever	Count
0	0	6,080
0	1	320
1	0	1,440
1	1	2,160

The prior on flu: P(Flu=1) = (1,440 + 2,160) / 10,000 = 0.36.

The posterior given fever: P(Flu=1 | Fever=1) = 2,160 / (320 + 2,160) = 2,160 / 2,480 ≈ 0.87.

Observing a fever dramatically increased our belief in the flu — from 36% to 87%. That is the power of conditioning.

Think of it this way: The prior is the "base rate" — how common flu is in the general population. The posterior is the "updated rate" once you have specific evidence about this patient. A good diagnostic tool should move the posterior far from the prior when the evidence is strong.

Check: If you observe no evidence at all, what is the posterior equal to?

The prior (no evidence means no update) Zero (no evidence means no probability) 0.5 (maximum uncertainty)

Chapter 3: Likelihood

There is a subtlety that becomes critical when the condition involves a continuous variable. Suppose we change Fever from binary to continuous, with a normal distribution:

If Flu = 0: Fever ~ N(μ = 98.3, σ = 0.7)
If Flu = 1: Fever ~ N(μ = 100.0, σ = 1.8)

The likelihood is the probability (or density) of the observed evidence given a hypothesis. If we observe Fever = 101, the likelihood under each hypothesis is:

L(Flu=0) = f(101 | μ=98.3, σ=0.7) ≈ 0.0000003
L(Flu=1) = f(101 | μ=100.0, σ=1.8) ≈ 0.199

The likelihood under Flu=1 is about 660,000 times larger than under Flu=0. A fever of 101 is vastly more consistent with having the flu than being healthy.

Key insight: The likelihood is not a probability of the hypothesis. It is the probability of the evidence given the hypothesis. L(Flu=0) and L(Flu=1) do not need to sum to 1. They measure how well each hypothesis explains the data.

The rejection sampling problem: When Fever is continuous, no particle will have exactly Fever = 101. Every single particle gets rejected. Our denominator is zero — a divide-by-zero error.

One quick fix: bucket rounding. Round all sampled fever values to the nearest integer, then condition on the rounded value. This introduces approximation error but makes the method work.

A better fix is likelihood weighting: instead of rejecting particles, assign each particle a weight proportional to how consistent it is with the evidence. Particles that match the evidence well get high weight; particles that don't get low weight. This avoids waste and handles continuous conditioning.

There is also MCMC (Markov Chain Monte Carlo), specifically a variant called Gibbs sampling. Instead of sampling from the joint and rejecting, MCMC samples from the posterior directly by constructing a Markov chain whose stationary distribution is the posterior. Each sample depends on the previous one (hence "chain"), but after enough steps, you get samples from the correct distribution. Gibbs sampling is especially elegant: it updates one variable at a time, conditioned on the current values of all other variables. This requires knowing each variable's conditional distribution given its Markov blanket (parents, children, and co-parents). CS221 and CS228 cover MCMC in depth.

Worked example: We want P(Flu=1 | Fever=101). Using Bayes' theorem and our likelihoods:

P(Flu=1 | Fever=101) = L(Flu=1) · P(Flu=1) / [ L(Flu=1) · P(Flu=1) + L(Flu=0) · P(Flu=0) ]

With P(Flu=1) = 0.2 and P(Flu=0) = 0.8:

Numerator: 0.199 × 0.2 = 0.0398

Denominator: 0.0398 + 0.0000003 × 0.8 ≈ 0.0398

Result: P(Flu=1 | Fever=101) ≈ 0.9999. A fever of 101 makes flu nearly certain.

Continuous variables break rejection sampling because the probability of any exact value is zero. Solutions include bucketing (approximate but simple), likelihood weighting (exact up to sampling noise), and MCMC methods like Gibbs sampling (powerful but more complex). CS221 and CS228 cover these advanced methods.

Check: Why does rejection sampling fail when conditioning on a continuous variable?

No particle will ever exactly match a specific continuous value, so all particles get rejected Continuous variables cannot be part of Bayesian networks The likelihood is always zero for continuous variables

Chapter 4: Computing Posteriors

Let's put it all together and watch rejection sampling compute a posterior in real time. The simulation below models a simple disease network. You set the prior probability of flu and the evidence (fever observed or not), and the algorithm generates particles one by one, rejecting those inconsistent with your evidence.

Rejection Sampling — Live

Watch particles get sampled and either accepted (green) or rejected (red). The posterior estimate converges as more particles accumulate.

Particles: 0 | Accepted: 0

Prior P(Flu)20%

Speed10/frame

What to watch for: Early on, the posterior estimate (orange line) bounces wildly. As more accepted particles accumulate, it converges. With N ≈ 1,000 accepted particles, the estimate is typically within a few percentage points of the true value. This is the law of large numbers in action.

Worked example: Suppose we run 10,000 particles with P(Flu) = 0.2, P(Fever|Flu) = 0.6, P(Fever|Healthy) = 0.05. We condition on Fever = 1.

Expected particles with Fever=1: 0.2 × 0.6 + 0.8 × 0.05 = 0.12 + 0.04 = 0.16 → about 1,600 accepted.

Among those, fraction with Flu=1: 0.12 / 0.16 = 0.75.

So P(Flu=1 | Fever=1) ≈ 0.75. The prior was 0.2, the posterior is 0.75 — fever is strong evidence for flu.

Check: If only 0.01% of particles match the condition, what practical problem arises?

The posterior estimate will be very wrong due to rounding Very few accepted particles means high variance and possibly divide-by-zero The Bayesian network is incorrectly specified

Chapter 5: Fairness Definitions

Now that we can compute conditional probabilities, let's use them to ask an uncomfortable question: is our algorithm fair?

Consider a loan prediction algorithm. It outputs a binary guess G (approve or deny) for each applicant. Each person has a true outcome T (will they actually repay?) and belongs to a demographic group D. We have a joint probability table from historical predictions:

D = 0 (majority group)
	G = 0	G = 1
T = 0	0.21	0.32
T = 1	0.07	0.28

D = 1 (minority group)
	G = 0	G = 1
T = 0	0.01	0.01
T = 1	0.02	0.08

Each cell is P(D=i, G=j, T=k). All 8 cells sum to 1. Let's marginalize: P(D=0) = 0.21 + 0.32 + 0.07 + 0.28 = 0.88. P(D=1) = 0.01 + 0.01 + 0.02 + 0.08 = 0.12.

Definition 1 — Parity: An algorithm satisfies parity if P(G=1|D=0) = P(G=1|D=1). In words: the algorithm approves at the same rate regardless of group membership.

Testing parity:

P(G=1 | D=1) = (0.01 + 0.08) / 0.12 = 0.09 / 0.12 = 0.75

P(G=1 | D=0) = (0.32 + 0.28) / 0.88 = 0.60 / 0.88 ≈ 0.682

Since 0.75 ≠ 0.682, parity is violated. The algorithm approves group D=1 at a higher rate.

Definition 2 — Calibration: An algorithm satisfies calibration if P(G=T|D=0) = P(G=T|D=1). In words: the algorithm is equally accurate for both groups.

Testing calibration:

P(G=T | D=0) = P(G=0,T=0|D=0) + P(G=1,T=1|D=0) = (0.21 + 0.28) / 0.88 ≈ 0.557

P(G=T | D=1) = (0.01 + 0.08) / 0.12 = 0.75

Since 0.557 ≠ 0.75, calibration is violated. The algorithm is more accurate for the minority group.

Definition 3 — Equality of Odds: An algorithm satisfies equality of odds if P(G=1|D=0, T=1) = P(G=1|D=1, T=1). Among people who truly deserve approval, the algorithm approves them at the same rate regardless of group.

Testing equality of odds:

P(G=1 | D=1, T=1) = 0.08 / (0.02 + 0.08) = 0.08 / 0.10 = 0.80

P(G=1 | D=0, T=1) = 0.28 / (0.07 + 0.28) = 0.28 / 0.35 = 0.80

Since 0.80 = 0.80, equality of odds is satisfied. Among truly deserving people, both groups are approved equally.

Check: Which fairness definition does this algorithm satisfy?

Parity Calibration Equality of Odds

Chapter 6: Independence & Separation

The three fairness definitions — parity, calibration, and equality of odds — sound like they should all hold simultaneously. A fair algorithm should approve at the same rate, be equally accurate, and catch truly deserving people equally. Shouldn't these be compatible?

They are not. This is the Impossibility Theorem of Machine Fairness: except in trivial cases, no algorithm can satisfy all three definitions at once.

The Impossibility Theorem: Parity, calibration, and equality of odds cannot all hold simultaneously unless (1) the algorithm is perfect (never makes mistakes) or (2) the base rates P(T=1|D=0) and P(T=1|D=1) are identical across groups. In the real world, neither condition typically holds.

Why does this happen? Consider the relationship between the three definitions. Parity concerns G and D unconditionally. Calibration concerns the joint (G, T) and D. Equality of odds conditions on T. Each definition "uses" the variables T, G, and D in a different way. When base rates differ between groups — as they almost always do in real data — satisfying one definition forces a violation of another.

The Gender Shades study: In 2018, Joy Buolamwini and Timnit Gebru examined facial recognition classifiers from IBM, Microsoft, and Face++. They measured accuracy (calibration) across demographic intersections:

IBM Classifier	Accuracy
Lighter Men	99.7%
Lighter Women	92.9%
Darker Men	88.0%
Darker Women	65.3%

The gap between the best-served group (lighter men: 99.7%) and the worst-served group (darker women: 65.3%) is 34.4 percentage points. This is a massive calibration failure. The algorithm is not equally accurate for different demographic groups.

Why intersectionality matters: Looking at gender alone (women: ~79%, men: ~94%) or skin tone alone (lighter: ~95%, darker: ~77%) understates the problem. The worst-off group is the intersection — darker-skinned women — who suffer both penalties compounded. This is why fairness audits must examine intersectional subgroups.

The deeper lesson: these systems were trained on datasets dominated by lighter-skinned male faces. When you optimize for average accuracy, the majority group gets most of the weight, and minority intersections suffer disproportionately. Fairness requires intentional, demographic-aware evaluation.

Worked example: Suppose an algorithm has overall accuracy 90%. Group A (70% of users) has accuracy 95%. Group B (30% of users) has accuracy 78%. The weighted average: 0.7 × 95 + 0.3 × 78 = 66.5 + 23.4 = 89.9% ≈ 90%. The high overall number hides a 17-point gap. This is why average accuracy is a misleading metric when groups have different base rates or representation.

Ways forward: Wadsworth et al. (2018) proposed adversarial debiasing, where you train an adversary to detect the demographic from the algorithm's predictions. If the adversary succeeds, the main model is leaking demographic information and you penalize it. Other approaches include re-sampling training data, post-hoc calibration per group, and using fairness constraints as regularizers during training.

Check: The Impossibility Theorem says parity, calibration, and equality of odds can all hold simultaneously only if:

The base rates are equal across groups or the algorithm is perfect The algorithm uses deep learning The number of samples is large enough

Chapter 7: Showcase — Inference & Fairness Calculator

This interactive simulation brings together both topics. On the left, you see a live rejection sampling engine: set a prior and likelihood, and watch the posterior converge as particles accumulate. On the right, a fairness dashboard lets you set up a joint table and instantly see which fairness definitions hold.

Bayesian Inference Calculator

Adjust the prior P(H) and the likelihoods P(E|H) and P(E|¬H). The posterior P(H|E) is computed via Bayes' theorem and visualized as a bar chart. The bottom chart shows how the rejection-sampling estimate converges.

Samples: 0

Prior P(H)20%

P(E|H)60%

P(E|¬H)5%

Fairness Dashboard

This visualization computes all three fairness metrics from the Piech loan-prediction example. Bars show the metric values for each group; green means the definition is satisfied, red means violated.

Try this: In the Bayesian calculator, set P(E|H) = 90% and P(E|¬H) = 90%. The posterior equals the prior — the evidence is useless because it is equally likely under both hypotheses. Now set P(E|¬H) = 1%. The posterior shoots up — the evidence strongly distinguishes the hypotheses.

Check: If P(E|H) = P(E|¬H), what happens to the posterior P(H|E)?

It equals the prior — the evidence is uninformative It goes to 1 It goes to 0

Chapter 8: Worked Problems

Problem 1: Rejection sampling efficiency. You have a Bayes net with 5 binary variables. You want to compute P(X₁=1 | X₅=1), and P(X₅=1) = 0.001. If you generate 1,000,000 particles, approximately how many will be accepted?

Solution: The acceptance rate equals P(condition) = P(X₅=1) = 0.001. Of 1,000,000 particles, about 0.001 × 1,000,000 = 1,000 will be accepted. With 1,000 accepted particles, the posterior estimate will have a standard error of roughly 1/√1000 ≈ 3.2%. Not great for a million samples.

Problem 2: Fairness computation. Given this joint probability table for a binary classifier:

D = 0
	G = 0	G = 1
T = 0	0.30	0.10
T = 1	0.05	0.15

D = 1
	G = 0	G = 1
T = 0	0.15	0.05
T = 1	0.05	0.15

Solution:
P(D=0) = 0.30 + 0.10 + 0.05 + 0.15 = 0.60. P(D=1) = 0.15 + 0.05 + 0.05 + 0.15 = 0.40.
Parity: P(G=1|D=0) = (0.10+0.15)/0.60 = 0.417. P(G=1|D=1) = (0.05+0.15)/0.40 = 0.50. Not equal → violated.
Calibration: P(G=T|D=0) = (0.30+0.15)/0.60 = 0.75. P(G=T|D=1) = (0.15+0.15)/0.40 = 0.75. Equal → satisfied.
Equality of Odds: P(G=1|D=0,T=1) = 0.15/(0.05+0.15) = 0.75. P(G=1|D=1,T=1) = 0.15/(0.05+0.15) = 0.75. Equal → satisfied.

Problem 3: Bayes' theorem from counts. You run rejection sampling and get 50,000 accepted particles (condition: Fever=1). Of these, 38,000 have Flu=1 and 12,000 have Flu=0. What is the approximate posterior P(Flu=1 | Fever=1)?

Solution: P(Flu=1 | Fever=1) ≈ 38,000 / 50,000 = 0.76. That's it — rejection sampling reduces Bayesian inference to counting.

Problem 4: Likelihood ratio. A healthy person has temperature N(98.6, 0.5). A sick person has temperature N(101.0, 1.5). You observe temperature 100.5. Which hypothesis does this favor, and by what factor?

Solution:
L(healthy) = f(100.5 | 98.6, 0.5): z = (100.5−98.6)/0.5 = 3.8. f(3.8) is tiny ≈ 0.000292.
L(sick) = f(100.5 | 101.0, 1.5): z = (100.5−101.0)/1.5 = −0.333. f(−0.333) ≈ 0.378/1.5 ≈ 0.252.
Likelihood ratio: 0.252 / 0.000292 ≈ 863. The observation favors "sick" by a factor of ~863.

Check: In Problem 2, which fairness definition is violated?

Parity Calibration Equality of Odds

Chapter 9: Connections

This chapter connects general inference and fairness to the broader landscape of probability and machine learning.

From This Chapter	Connects To
Rejection sampling	Chapter 10 (Bayes Nets) — provides the graph structure that makes ancestral sampling possible
Prior & posterior	Chapter 12 (Beta distributions) — the Beta is the conjugate prior for Bernoulli, giving closed-form posteriors
Likelihood	Chapter 14 (MLE) — maximum likelihood estimation finds parameters by maximizing the likelihood function
Fairness definitions	Ethics in AI — the impossibility theorem drives active research in algorithmic fairness (Barocas & Hardt, NeurIPS 2017)
MCMC / Gibbs sampling	CS221, CS228 — more powerful inference algorithms for complex models with continuous variables
Gender Shades	FAccT conference — Buolamwini & Gebru (2018) launched a field of fairness auditing

The big picture: General inference is the computational engine of probabilistic AI. You specify a model (Bayes net), you ask a question (conditional probability), and the engine answers. Rejection sampling is the simplest engine. But with power comes responsibility: once you deploy these models to make decisions about people, fairness is not optional. The impossibility theorem guarantees that tradeoffs exist, and choosing which fairness criterion to prioritize is fundamentally a human decision, not a technical one.

What comes next: Chapter 12 introduces the Beta distribution, which gives us a beautiful closed-form way to update priors — no sampling needed. The Beta-Bernoulli model is the first "conjugate" pair you'll learn, and it connects directly to the prior-posterior framework we built here.

References:

[1] Buolamwini, J. & Gebru, T. "Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification." FAccT, 2018.
[2] Barocas, S. & Hardt, M. "Fairness in Machine Learning." NeurIPS Tutorial, 2017.
[3] Pessach, D. & Shmueli, E. "A Review on Fairness in Machine Learning." ACM Computing Surveys, 2022.
[4] Piech, C. "Probability for Computer Scientists." Stanford CS109, 2023.

Check: The impossibility theorem tells us that choosing a fairness criterion is fundamentally a ___ decision.

Technical Human / ethical Statistical