Answering any probability question on a Bayesian network — and asking whether those answers are fair.
You have built a Bayesian network with dozens of random variables — diseases, symptoms, demographics, lab results. The graph encodes the full joint distribution compactly. But now someone asks: "Given that the patient has a fever and is a university student, what is the probability they have the flu?" How do you actually compute that answer?
Analytic computation — writing out the conditional probability formula and summing over all assignments — is possible in theory but explodes combinatorially. With 20 binary variables, the joint table has 220 ≈ 1 million cells. With continuous variables, the sum becomes an integral over infinite values. We need a better way.
General inference is the art of answering arbitrary probability queries on a Bayesian network. The key insight: instead of computing exact probabilities, we can sample thousands of particles from the joint distribution and then count. This is called rejection sampling, and it is one of the great practical ideas in probabilistic AI.
But sampling is not the end of the story. Once you can answer probability questions, a deeper question emerges: are your answers fair? If an algorithm uses this network to decide who gets a loan or who gets flagged for extra screening, does it treat different demographic groups equitably?
This chapter covers two intertwined topics. First, we develop rejection sampling and understand its limits. Then we introduce three formal definitions of fairness — parity, calibration, and equality of odds — and discover a startling result: you cannot satisfy all three at once.
| Topic | Key Question |
|---|---|
| Rejection Sampling | How do I compute P(X|Y) on a large Bayes net? |
| Likelihood Weighting | What if the condition is too rare for rejection sampling? |
| Parity | Does the algorithm predict positive at the same rate for all groups? |
| Calibration | Is the algorithm equally accurate for all groups? |
| Equality of Odds | Among truly positive cases, does it catch them equally across groups? |
| Impossibility Theorem | Can any algorithm satisfy all three? |
Let's start with the simplest inference technique: rejection sampling. The idea is almost embarrassingly simple. You generate a huge number of particles — joint samples from all the variables in your Bayes net — and then answer questions by counting.
To generate one particle, sample each variable in topological order (parents before children). For each variable, sample its value from the conditional distribution given its parents' already-sampled values. This is called ancestral sampling.
Here is a concrete example using a Simple Disease Model with four variables: Uni (is a university student), Influenza, Fever, and Tired.
This produces one particle: [Uni=1, Flu=0, Fever=0, Tired=0]. Each particle's probability equals the product of the conditionals used to generate it: P(Uni=1) × P(Flu=0|Uni=1) × P(Fever=0|Flu=0) × P(Tired=0|Uni=1,Flu=0) = 0.8 × 0.8 × 0.95 × 0.2 = 0.1216. But we don't need to compute this — the beauty of sampling is that particles appear with frequency proportional to their probability automatically.
Repeat N = 100,000 times and you have a collection of particles that approximates the joint distribution.
Now the magic: to compute any probability, just count.
Where N(X = k) is the number of particles where variables X take values k. For conditional probabilities, use the definition:
In words: filter your particles to those where Y = b (the condition). Among those, count how many also have X = a. That fraction is your answer.
python def rejection_sampling(query, condition, N=100000): # Generate N particles from the Bayes net particles = [sample_particle() for _ in range(N)] # Reject particles inconsistent with condition consistent = [p for p in particles if matches(p, condition)] # Count those also matching query hits = sum(1 for p in consistent if matches(p, query)) return hits / len(consistent)
Before we see any evidence, we have a prior belief about the world. In our disease model, the prior probability of having the flu might be P(Flu = 1) = 0.2 for a university student. This is what we believe before taking any measurements.
After observing evidence — say, the patient has a fever of 101 — our belief updates. The updated belief is called the posterior: P(Flu = 1 | Fever = 101). This is what we believe after (post) incorporating evidence.
The posterior is almost always different from the prior, because evidence changes what we believe. The whole point of inference is computing this posterior.
In a Bayesian network, the prior for any variable is what you get by marginalizing the joint distribution over all other variables. The posterior is what you get after conditioning on observed evidence. Rejection sampling gives us both: the raw particle counts approximate the prior, and filtered counts approximate the posterior.
Mathematically, the update from prior to posterior follows Bayes' theorem:
Where P(H) is the prior, P(E|H) is the likelihood (how probable the evidence is if H is true), and P(E) is the marginal probability of the evidence (a normalizing constant). The posterior P(H|E) is the updated belief. Rejection sampling computes this ratio implicitly through counting.
Worked example: Suppose we run 10,000 particles through the Simple Disease Model and get:
| Flu | Fever | Count |
|---|---|---|
| 0 | 0 | 6,080 |
| 0 | 1 | 320 |
| 1 | 0 | 1,440 |
| 1 | 1 | 2,160 |
The prior on flu: P(Flu=1) = (1,440 + 2,160) / 10,000 = 0.36.
The posterior given fever: P(Flu=1 | Fever=1) = 2,160 / (320 + 2,160) = 2,160 / 2,480 ≈ 0.87.
Observing a fever dramatically increased our belief in the flu — from 36% to 87%. That is the power of conditioning.
There is a subtlety that becomes critical when the condition involves a continuous variable. Suppose we change Fever from binary to continuous, with a normal distribution:
The likelihood is the probability (or density) of the observed evidence given a hypothesis. If we observe Fever = 101, the likelihood under each hypothesis is:
The likelihood under Flu=1 is about 660,000 times larger than under Flu=0. A fever of 101 is vastly more consistent with having the flu than being healthy.
The rejection sampling problem: When Fever is continuous, no particle will have exactly Fever = 101. Every single particle gets rejected. Our denominator is zero — a divide-by-zero error.
One quick fix: bucket rounding. Round all sampled fever values to the nearest integer, then condition on the rounded value. This introduces approximation error but makes the method work.
A better fix is likelihood weighting: instead of rejecting particles, assign each particle a weight proportional to how consistent it is with the evidence. Particles that match the evidence well get high weight; particles that don't get low weight. This avoids waste and handles continuous conditioning.
There is also MCMC (Markov Chain Monte Carlo), specifically a variant called Gibbs sampling. Instead of sampling from the joint and rejecting, MCMC samples from the posterior directly by constructing a Markov chain whose stationary distribution is the posterior. Each sample depends on the previous one (hence "chain"), but after enough steps, you get samples from the correct distribution. Gibbs sampling is especially elegant: it updates one variable at a time, conditioned on the current values of all other variables. This requires knowing each variable's conditional distribution given its Markov blanket (parents, children, and co-parents). CS221 and CS228 cover MCMC in depth.
Worked example: We want P(Flu=1 | Fever=101). Using Bayes' theorem and our likelihoods:
With P(Flu=1) = 0.2 and P(Flu=0) = 0.8:
Numerator: 0.199 × 0.2 = 0.0398
Denominator: 0.0398 + 0.0000003 × 0.8 ≈ 0.0398
Result: P(Flu=1 | Fever=101) ≈ 0.9999. A fever of 101 makes flu nearly certain.
Let's put it all together and watch rejection sampling compute a posterior in real time. The simulation below models a simple disease network. You set the prior probability of flu and the evidence (fever observed or not), and the algorithm generates particles one by one, rejecting those inconsistent with your evidence.
Watch particles get sampled and either accepted (green) or rejected (red). The posterior estimate converges as more particles accumulate.
Worked example: Suppose we run 10,000 particles with P(Flu) = 0.2, P(Fever|Flu) = 0.6, P(Fever|Healthy) = 0.05. We condition on Fever = 1.
Expected particles with Fever=1: 0.2 × 0.6 + 0.8 × 0.05 = 0.12 + 0.04 = 0.16 → about 1,600 accepted.
Among those, fraction with Flu=1: 0.12 / 0.16 = 0.75.
So P(Flu=1 | Fever=1) ≈ 0.75. The prior was 0.2, the posterior is 0.75 — fever is strong evidence for flu.
Now that we can compute conditional probabilities, let's use them to ask an uncomfortable question: is our algorithm fair?
Consider a loan prediction algorithm. It outputs a binary guess G (approve or deny) for each applicant. Each person has a true outcome T (will they actually repay?) and belongs to a demographic group D. We have a joint probability table from historical predictions:
| D = 0 (majority group) | ||
|---|---|---|
| G = 0 | G = 1 | |
| T = 0 | 0.21 | 0.32 |
| T = 1 | 0.07 | 0.28 |
| D = 1 (minority group) | ||
|---|---|---|
| G = 0 | G = 1 | |
| T = 0 | 0.01 | 0.01 |
| T = 1 | 0.02 | 0.08 |
Each cell is P(D=i, G=j, T=k). All 8 cells sum to 1. Let's marginalize: P(D=0) = 0.21 + 0.32 + 0.07 + 0.28 = 0.88. P(D=1) = 0.01 + 0.01 + 0.02 + 0.08 = 0.12.
Testing parity:
P(G=1 | D=1) = (0.01 + 0.08) / 0.12 = 0.09 / 0.12 = 0.75
P(G=1 | D=0) = (0.32 + 0.28) / 0.88 = 0.60 / 0.88 ≈ 0.682
Since 0.75 ≠ 0.682, parity is violated. The algorithm approves group D=1 at a higher rate.
Testing calibration:
P(G=T | D=0) = P(G=0,T=0|D=0) + P(G=1,T=1|D=0) = (0.21 + 0.28) / 0.88 ≈ 0.557
P(G=T | D=1) = (0.01 + 0.08) / 0.12 = 0.75
Since 0.557 ≠ 0.75, calibration is violated. The algorithm is more accurate for the minority group.
Testing equality of odds:
P(G=1 | D=1, T=1) = 0.08 / (0.02 + 0.08) = 0.08 / 0.10 = 0.80
P(G=1 | D=0, T=1) = 0.28 / (0.07 + 0.28) = 0.28 / 0.35 = 0.80
Since 0.80 = 0.80, equality of odds is satisfied. Among truly deserving people, both groups are approved equally.
The three fairness definitions — parity, calibration, and equality of odds — sound like they should all hold simultaneously. A fair algorithm should approve at the same rate, be equally accurate, and catch truly deserving people equally. Shouldn't these be compatible?
They are not. This is the Impossibility Theorem of Machine Fairness: except in trivial cases, no algorithm can satisfy all three definitions at once.
Why does this happen? Consider the relationship between the three definitions. Parity concerns G and D unconditionally. Calibration concerns the joint (G, T) and D. Equality of odds conditions on T. Each definition "uses" the variables T, G, and D in a different way. When base rates differ between groups — as they almost always do in real data — satisfying one definition forces a violation of another.
The Gender Shades study: In 2018, Joy Buolamwini and Timnit Gebru examined facial recognition classifiers from IBM, Microsoft, and Face++. They measured accuracy (calibration) across demographic intersections:
| IBM Classifier | Accuracy |
|---|---|
| Lighter Men | 99.7% |
| Lighter Women | 92.9% |
| Darker Men | 88.0% |
| Darker Women | 65.3% |
The gap between the best-served group (lighter men: 99.7%) and the worst-served group (darker women: 65.3%) is 34.4 percentage points. This is a massive calibration failure. The algorithm is not equally accurate for different demographic groups.
The deeper lesson: these systems were trained on datasets dominated by lighter-skinned male faces. When you optimize for average accuracy, the majority group gets most of the weight, and minority intersections suffer disproportionately. Fairness requires intentional, demographic-aware evaluation.
Worked example: Suppose an algorithm has overall accuracy 90%. Group A (70% of users) has accuracy 95%. Group B (30% of users) has accuracy 78%. The weighted average: 0.7 × 95 + 0.3 × 78 = 66.5 + 23.4 = 89.9% ≈ 90%. The high overall number hides a 17-point gap. This is why average accuracy is a misleading metric when groups have different base rates or representation.
This interactive simulation brings together both topics. On the left, you see a live rejection sampling engine: set a prior and likelihood, and watch the posterior converge as particles accumulate. On the right, a fairness dashboard lets you set up a joint table and instantly see which fairness definitions hold.
Adjust the prior P(H) and the likelihoods P(E|H) and P(E|¬H). The posterior P(H|E) is computed via Bayes' theorem and visualized as a bar chart. The bottom chart shows how the rejection-sampling estimate converges.
This visualization computes all three fairness metrics from the Piech loan-prediction example. Bars show the metric values for each group; green means the definition is satisfied, red means violated.
Problem 1: Rejection sampling efficiency. You have a Bayes net with 5 binary variables. You want to compute P(X1=1 | X5=1), and P(X5=1) = 0.001. If you generate 1,000,000 particles, approximately how many will be accepted?
Problem 2: Fairness computation. Given this joint probability table for a binary classifier:
| D = 0 | ||
|---|---|---|
| G = 0 | G = 1 | |
| T = 0 | 0.30 | 0.10 |
| T = 1 | 0.05 | 0.15 |
| D = 1 | ||
|---|---|---|
| G = 0 | G = 1 | |
| T = 0 | 0.15 | 0.05 |
| T = 1 | 0.05 | 0.15 |
Problem 3: Bayes' theorem from counts. You run rejection sampling and get 50,000 accepted particles (condition: Fever=1). Of these, 38,000 have Flu=1 and 12,000 have Flu=0. What is the approximate posterior P(Flu=1 | Fever=1)?
Problem 4: Likelihood ratio. A healthy person has temperature N(98.6, 0.5). A sick person has temperature N(101.0, 1.5). You observe temperature 100.5. Which hypothesis does this favor, and by what factor?
This chapter connects general inference and fairness to the broader landscape of probability and machine learning.
| From This Chapter | Connects To |
|---|---|
| Rejection sampling | Chapter 10 (Bayes Nets) — provides the graph structure that makes ancestral sampling possible |
| Prior & posterior | Chapter 12 (Beta distributions) — the Beta is the conjugate prior for Bernoulli, giving closed-form posteriors |
| Likelihood | Chapter 14 (MLE) — maximum likelihood estimation finds parameters by maximizing the likelihood function |
| Fairness definitions | Ethics in AI — the impossibility theorem drives active research in algorithmic fairness (Barocas & Hardt, NeurIPS 2017) |
| MCMC / Gibbs sampling | CS221, CS228 — more powerful inference algorithms for complex models with continuous variables |
| Gender Shades | FAccT conference — Buolamwini & Gebru (2018) launched a field of fairness auditing |
What comes next: Chapter 12 introduces the Beta distribution, which gives us a beautiful closed-form way to update priors — no sampling needed. The Beta-Bernoulli model is the first "conjugate" pair you'll learn, and it connects directly to the prior-posterior framework we built here.
References:
[1] Buolamwini, J. & Gebru, T. "Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification." FAccT, 2018.
[2] Barocas, S. & Hardt, M. "Fairness in Machine Learning." NeurIPS Tutorial, 2017.
[3] Pessach, D. & Shmueli, E. "A Review on Fairness in Machine Learning." ACM Computing Surveys, 2022.
[4] Piech, C. "Probability for Computer Scientists." Stanford CS109, 2023.