The framework that lets machines reason about cause and effect under uncertainty — from medical diagnosis to spam filtering to gene regulatory networks.
It's raining outside. The grass is wet. Are these facts independent? Obviously not — rain causes wet grass. But the sprinkler could also cause wet grass. And maybe the season affects both rain and sprinkler usage. Events in the real world form a tangled web of dependencies.
A Bayesian network represents this web compactly. Instead of specifying a probability for every possible combination of events (which grows exponentially), we only specify local dependencies — how each variable relates to its direct causes.
Click nodes to toggle their state (true/false). Watch how evidence propagates through the network.
A Bayesian network is a DAG: a Directed Acyclic Graph. Each node is a random variable. Each directed edge (arrow) means "this variable directly influences that one." The "acyclic" part means no loops — you can't follow arrows and end up where you started.
The graph encodes the factorization of the joint probability. For variables X1, ..., Xn with parents Pa(Xi):
This factorization is why Bayes nets are efficient. Instead of one giant table with 2n entries, we have n small tables, each conditioned on only a few parents.
Hover over nodes to see their parents, children, and Markov blanket highlighted.
We stated that for a Bayesian network with variables X1, ..., Xn arranged in topological order:
Your task: Derive this factorization using only the chain rule of probability and the conditional independence assumptions encoded by the DAG. Show why removing edges (claiming independence) lets us drop variables from each conditional.
Full derivation:
Step 1. Chain rule (always valid):
P(X1,...,Xn) = Πi P(Xi | X1,...,Xi-1)
Step 2. Order variables topologically. For each Xi, the set {X1,...,Xi-1} contains all ancestors and some non-ancestors.
Step 3. The DAG encodes the local Markov property: each node is conditionally independent of its non-descendants given its parents. Since in topological order all predecessors are either parents or non-descendants:
P(Xi | X1,...,Xi-1) = P(Xi | Pa(Xi))
Step 4. Substitute into the chain rule:
P(X1,...,Xn) = Πi P(Xi | Pa(Xi)) ■
The key insight: The factorization isn't an approximation — it's exact IF the independence assumptions are true. Every missing edge is an independence claim. More missing edges = smaller CPTs = more compression. But claim a false independence and your model is wrong.
Each node stores a CPT (Conditional Probability Table). For a root node (no parents), it's just a prior: P(Rain) = 0.2. For a node with parents, it specifies the probability for each combination of parent states: P(Wet Grass | Rain=T, Sprinkler=T) = 0.99.
The CPT is where domain knowledge lives. A doctor building a medical Bayes net fills in CPTs like P(Cough | Flu=yes, Smoker=yes) = 0.85 based on clinical experience or data.
Click nodes to see their CPT. Adjust probabilities with the sliders below.
| Rain | Sprinkler | P(Wet=T) |
|---|---|---|
| F | F | 0.01 |
| F | T | 0.90 |
| T | F | 0.80 |
| T | T | 0.99 |
[2] = [0.2, 0.8]. P(Sprinkler|Rain) is [2,2] = [[0.4, 0.6], [0.01, 0.99]]. P(Wet|Sprinkler,Rain) is [2,2,2] = 8 entries. Total parameters: 2 + 4 + 8 = 14. Without the network, the full joint over 3 binary variables requires 2³ = 8 entries. Here the savings are modest, but with 10 variables (each with ≤2 parents) you store ~60 parameters instead of 210 = 1,024. With 50 variables, it's thousands vs. 250 ≈ 1015.python import numpy as np # CPTs are just arrays indexed by parent states P_rain = np.array([0.2, 0.8]) # [T, F] P_sprinkler_given_rain = np.array([ # [rain_state, sprinkler_state] [0.4, 0.6], # Rain=T: 40% sprinkler on [0.01, 0.99] # Rain=F: 1% sprinkler on ]) P_wet_given_sr = np.array([ # [sprinkler, rain, wet_state] [[0.99, 0.01], [0.80, 0.20]], # S=T [[0.90, 0.10], [0.01, 0.99]] # S=F ]) # Total params: 2 + 4 + 8 = 14 # Full joint would be 2^3 = 8 (modest savings) # With 20 sparse variables: ~100 vs 2^20 = 1,048,576
The whole point of a Bayes net is to encode conditional independencies. Two variables X and Y are conditionally independent given Z (written X ⊥ Y | Z) if knowing Z makes X irrelevant for predicting Y.
Example: Does knowing the season help you predict if the grass is wet? Yes. But if you already know whether it rained AND whether the sprinkler was on, then season adds nothing. Grass is conditionally independent of Season given {Rain, Sprinkler}.
Select two variables and a conditioning set. The display shows whether they're conditionally independent in this network.
In practice, this means conditional independence lets you drop variables from computations. If you know Rain and Sprinkler, you don't need to ask about Season to predict Wet Grass. Every dropped variable halves (for binary) the size of the table you need to compute. With 5 irrelevant variables dropped, that's a 2&sup5; = 32x speedup.
Without conditional independence, computing P(Xi | all other variables) requires a table exponential in the number of variables. Conditional independence lets us replace "all other variables" with just the parents — typically 2-3 nodes. Concrete example: In a 20-node network where each node has at most 3 parents, a CPT has at most 2³ = 8 rows. Without independence, you'd need one entry per combination of all other 19 variables: 219 = 524,288 entries. That's a 65,000x reduction per node. Multiply across 20 nodes and the full joint's 220 = 1M entries compress to ~160 parameters total.
The key: conditional independence isn't just "nice to have" — it's the entire reason graphical models exist. The graph structure IS the set of independence claims.
How do you read off conditional independencies from the graph structure? The answer is d-separation. It's a purely graphical criterion: check if information can flow between two nodes along a path, given what you've observed.
There are three fundamental structures. Each behaves differently when the middle node is observed vs. unobserved:
Click the middle node to toggle observing it. Watch how information flow (green = flows, red = blocked) changes for each structure.
Consider the collider structure: A → C ← B, where A and B are independent causes of C. Let A = "earthquake", B = "burglar", C = "alarm sounds".
P(A) = 0.01, P(B) = 0.05, P(C=T|A,B) = 0.99, P(C=T|A,¬B) = 0.95, P(C=T|¬A,B) = 0.90, P(C=T|¬A,¬B) = 0.01.
Your task: Compute P(A|C=T) and P(A|C=T, B=T). Show numerically that learning B=true decreases the probability of A — demonstrating explaining away.
Part 1: P(A=T | C=T)
P(A=T|C=T) = P(C=T|A=T)·P(A=T) / P(C=T) = 0.952 · 0.01 / 0.06347 = 0.150
The alarm went off, so earthquake probability jumps from 1% to 15%. Makes sense — something caused the alarm.
Part 2: P(A=T | C=T, B=T)
Now we know a burglar IS present. P(A=T|C=T,B=T) = P(C=T|A=T,B=T)·P(A=T) / P(C=T|B=T)
P(C=T|B=T) = P(C=T|A=T,B=T)·P(A=T) + P(C=T|A=F,B=T)·P(A=F) = 0.99·0.01 + 0.90·0.99 = 0.0099 + 0.891 = 0.9009
P(A=T|C=T,B=T) = 0.99·0.01 / 0.9009 = 0.011
The key insight: P(A|C=T) = 0.150, but P(A|C=T,B=T) = 0.011. Learning that a burglar triggered the alarm "explains away" the earthquake — we no longer need the earthquake hypothesis. This is why colliders open paths: once you observe the effect, its causes become coupled through competition for explanatory credit.
Path: A → B → C ← D. There's only one path connecting A to D.
Given B: At node B, the structure is a chain (A→B→C). Observing B blocks this chain segment. The path is blocked. A ⊥ D | B. ✓
Given C: At node B, the structure is a chain (A→B→C) — B is NOT observed, so this segment is open. At node C, the structure is a collider (B→C←D). Observing C OPENS this collider. Both segments are open, so the path is active. A is NOT independent of D given C. ✗
This is the critical insight: observing a collider's child creates a dependency between causes that were previously independent. You MUST understand this before inference, because variable elimination exploits these independence structures to determine which factors interact.
Given evidence (some variables observed), we want to compute the posterior probability of a query variable. The brute-force way: sum over all combinations of unobserved variables. But that's exponential. Variable elimination does it smarter by summing out variables one at a time, exploiting the factored structure.
Query: P(Rain | Wet=true). Click "Next Step" to eliminate variables one at a time. Watch the factors combine and shrink.
Let's trace through with actual numbers. Query: P(Rain=T | Wet=T). Our factors are P(R), P(S|R), P(W|R,S). We set W=T and eliminate S first, then normalize.
python # P(Rain=T | Wet=T) — worked example # Step 1: Set evidence W=T in P(W|R,S) # f(R=T,S=T) = P(W=T|R=T,S=T) = 0.99 # f(R=T,S=F) = P(W=T|R=T,S=F) = 0.80 # f(R=F,S=T) = P(W=T|R=F,S=T) = 0.90 # f(R=F,S=F) = P(W=T|R=F,S=F) = 0.01 # Step 2: Eliminate S — sum over S in P(S|R)*f(R,S) # g(R=T) = P(S=T|R=T)*f(T,T) + P(S=F|R=T)*f(T,F) # = 0.40*0.99 + 0.60*0.80 = 0.396 + 0.480 = 0.876 # g(R=F) = P(S=T|R=F)*f(F,T) + P(S=F|R=F)*f(F,F) # = 0.01*0.90 + 0.99*0.01 = 0.009 + 0.0099 = 0.0189 # Step 3: Multiply by P(R) and normalize # h(R=T) = P(R=T)*g(R=T) = 0.2 * 0.876 = 0.1752 # h(R=F) = P(R=F)*g(R=F) = 0.8 * 0.0189 = 0.01512 # P(R=T|W=T) = 0.1752 / (0.1752+0.01512) = 0.920
Consider a chain A → B → C → D → E with all binary variables. You want P(E | A=true).
Your task: (1) Count the operations needed to eliminate variables in order B, C, D. (2) Now try the "bad" order D, C, B. Show why, for a chain, both orders give the same cost — but explain intuitively why for a different graph structure (like a tree with branching factor 3), elimination order dramatically matters.
Chain A→B→C→D→E, query P(E|A=T):
Good order (B, C, D):
1. Eliminate B: combine P(B|A=T)[2] × P(C|B)[2×2], sum over B → f(C)[2]. Cost: 4 mult + 2 add = 6 ops.
2. Eliminate C: combine f(C)[2] × P(D|C)[2×2], sum over C → g(D)[2]. Cost: 4 mult + 2 add = 6 ops.
3. Eliminate D: combine g(D)[2] × P(E|D)[2×2], sum over D → h(E)[2]. Cost: 4 mult + 2 add = 6 ops.
Total: 18 operations. Max intermediate factor size: 4.
Reverse order (D, C, B):
1. Eliminate D: combine P(D|C)[2×2] × P(E|D)[2×2], sum over D → f(C,E)[2×2]. Cost: 8 mult + 4 add = 12.
2. Eliminate C: combine P(C|B)[2×2] × f(C,E)[2×2], sum over C → g(B,E)[2×2]. Cost: 8 mult + 4 add = 12.
3. Eliminate B: combine P(B|A=T)[2] × g(B,E)[2×2], sum over B → h(E)[2]. Cost: 4 mult + 2 add = 6.
Total: 30 operations. Max intermediate factor size: 4.
The key insight: For chains, both orders produce factors of size ≤ 4 — the difference is modest. But for a tree with branching factor k=10: eliminating the root first creates a 210=1024-entry factor, while eliminating leaves first never exceeds 4 entries. The max intermediate factor size = 2(treewidth+1), and elimination order determines the effective treewidth. Finding the optimal order is NP-hard in general.
python from itertools import product def variable_elimination(network, evidence, query, elim_order): # Step 1: Build initial factors from CPTs factors = [] for node, info in network.items(): scope = info['parents'] + [node] cpt = info['cpt'] table = {} for assignment in product([0,1], repeat=len(scope)): # Index into nested CPT val = cpt for idx in assignment: val = val[idx] table[assignment] = val factors.append((scope, table)) # Step 2: Restrict evidence for i, (scope, table) in enumerate(factors): new_scope = [v for v in scope if v not in evidence] if new_scope == scope: continue new_table = {} for assignment in product([0,1], repeat=len(new_scope)): full = list(assignment) for j, v in enumerate(scope): if v in evidence: full.insert(j, evidence[v]) new_table[assignment] = table[tuple(full)] factors[i] = (new_scope, new_table) # Step 3: Eliminate variables one by one for var in elim_order: # Collect factors mentioning var relevant = [(s,t) for s,t in factors if var in s] remaining = [(s,t) for s,t in factors if var not in s] # Multiply all relevant factors combined_scope = list(dict.fromkeys( v for s,_ in relevant for v in s)) combined = {} for assignment in product([0,1], repeat=len(combined_scope)): amap = dict(zip(combined_scope, assignment)) val = 1.0 for s, t in relevant: key = tuple(amap[v] for v in s) val *= t[key] combined[assignment] = val # Sum out var var_idx = combined_scope.index(var) new_scope = [v for v in combined_scope if v != var] new_table = {} for assignment in product([0,1], repeat=len(new_scope)): total = 0.0 for val in [0, 1]: full = list(assignment) full.insert(var_idx, val) total += combined[tuple(full)] new_table[assignment] = total remaining.append((new_scope, new_table)) factors = remaining # Step 4: Multiply remaining, normalize result = [0.0, 0.0] for qval in [0, 1]: p = 1.0 for scope, table in factors: key = tuple(qval if v == query else evidence[v] for v in scope) p *= table.get(key, table.get((qval,), 1.0)) result[qval] = p total = sum(result) return [r/total for r in result]
For tree-structured networks (no undirected loops), there's an elegant alternative: belief propagation (the sum-product algorithm). Each node sends messages to its neighbors summarizing what it knows. After two passes (leaves to root, root to leaves), every node has the exact marginal posterior.
A message from node i to node j says: "Based on everything I've seen on my side of the tree, here's what I believe about your state." It's local computation with global results.
Watch messages flow through a tree. Click "Send Messages" to see each round of propagation. After two passes, all beliefs are exact.
In implementation, each message is just an array of floats — one entry per state of the receiving node. For binary variables, every message is a [2] vector. A node computes its belief by multiplying all incoming messages element-wise and normalizing. For our 7-node tree, that's 12 messages (one per edge direction), each of size 2. Total computation: 24 multiplications + 7 normalizations. Compare that to the full joint: 2&sup7; = 128 entries.
The HMM forward-backward algorithm IS belief propagation on a chain. The "forward" pass sends messages left-to-right (leaf → root). The "backward" pass sends right-to-left (root → leaves). The "message" at each step is exactly the α or β vector. The transition matrix A plays the role of φ, and the emission probability b plays the role of the local evidence. If you understand BP, you already understand forward-backward — it's the same algorithm on a specific graph topology.
Can you see the analogy further? What would "loopy BP" mean for an HMM? (Hint: think about HMMs with skip connections or factorial HMMs.)
Build your own Bayesian network! Click to add nodes, drag between nodes to add edges. Set CPTs and run inference queries. This is the full SLAM experience — but for probability.
So far we've assumed the graph structure is given. But what if you have data and need to discover the dependencies? This is structure learning — arguably the hardest problem in Bayesian networks.
There are two main approaches: score-based (search over possible graphs, score each one) and constraint-based (test conditional independencies in the data, build the graph from the results).
Generate data from a hidden network. Click "Learn Structure" to watch the PC algorithm discover the DAG by testing conditional independencies.
Bayesian networks are everywhere. An HMM (Hidden Markov Model) is a special case — a chain-structured Bayes net where hidden states emit observations. A Kalman filter is a continuous Gaussian Bayes net with linear dynamics.
| Model | Structure | Variables | Key Algorithm |
|---|---|---|---|
| Bayes Net | General DAG | Discrete/continuous | Variable elimination |
| HMM | Chain | Discrete hidden | Forward-backward |
| Kalman Filter | Chain | Continuous Gaussian | Predict-update |
| MRF/CRF | Undirected | Any | Belief propagation |
| Causal Model | DAG + interventions | Any | do-calculus |
python # A Bayes net = DAG + CPTs. That's it. network = { 'Rain': { 'parents': [], 'cpt': [0.2, 0.8] }, 'Sprinkler': { 'parents': ['Rain'], 'cpt': [[0.4, 0.6], [0.01, 0.99]] }, 'Wet': { 'parents': ['Sprinkler', 'Rain'], 'cpt': [[[0.99,0.01],[0.8,0.2]], [[0.9,0.1],[0.01,0.99]]] } } # Inference = message passing on this data structure # To query P(Rain|Wet=T): variable elimination over this dict
This computational wall is a key reason neural networks replaced Bayes nets for perception tasks. A Bayes net for image classification would need one node per pixel, with hand-designed dependencies between neighboring pixels — thousands of nodes, each with CPTs you'd somehow need to specify. A neural network learns these dependencies automatically via gradient descent, scaling to millions of parameters without anyone specifying a single conditional probability. The tradeoff: neural networks are black boxes, while Bayes nets give you interpretable, auditable reasoning.
Real-world solution (QMR-DT, Internist-1/QMR):
Structure: Two-layer bipartite graph. Diseases as roots, symptoms as leaves. NO disease-disease edges (comorbidity handled by the model assuming multiple diseases can be present simultaneously). NO symptom-symptom edges (keeps treewidth low).
CPT compression via Noisy-OR: Instead of 2k entries per symptom, use the noisy-OR assumption: each disease independently has a probability of causing the symptom, and the symptom appears if ANY cause fires. P(S=F | D1,...,Dk) = Πi: Di=T (1 - qi). This reduces 1024 parameters to just 10 (one qi per disease). Total network parameters: 10 priors + 30×10 = 310 instead of 30×1024 = 30,720.
Inference: With noisy-OR and the bipartite structure, exact inference via variable elimination is feasible because the effective treewidth is bounded by the number of active diseases (typically 1-3). For the worst case (all 10 diseases), use likelihood weighting (a simple Monte Carlo method) with 1000 samples — takes ~50ms on a Pi 4.
Missing data: Unobserved symptoms are simply not set as evidence — variable elimination naturally marginalizes over them. This is one of Bayes nets' biggest advantages: missing data requires no special handling.
A Bayesian network encodes correlational structure: "these variables tend to co-occur." A causal model adds one crucial distinction: the difference between observing and intervening. Observing X=x means conditioning — all paths remain. Intervening (do(X=x)) means forcing X to a value, which severs X from its parents (you delete all incoming edges to X). Same graph, profoundly different semantics. Pearl's do-calculus provides rules for when P(Y|do(X)) can be computed from observational data alone — and it relies entirely on the d-separation machinery you just learned.
If you observe that people who carry lighters have higher lung cancer rates, the Bayes net says P(Cancer | Lighter=yes) > P(Cancer). But does GIVING someone a lighter cause cancer? What edges would you cut?
You now understand the language of probabilistic reasoning with graphs. From medical diagnosis to robot perception, Bayesian networks turn uncertain evidence into principled decisions.