The Complete Beginner's Path

Understand Bayesian
Networks

The framework that lets machines reason about cause and effect under uncertainty — from medical diagnosis to spam filtering to gene regulatory networks.

Prerequisites: Basic probability + Intuition for graphs. That's it.
10
Chapters
7+
Simulations
0
Assumed Knowledge

Chapter 0: Variables with Dependencies

It's raining outside. The grass is wet. Are these facts independent? Obviously not — rain causes wet grass. But the sprinkler could also cause wet grass. And maybe the season affects both rain and sprinkler usage. Events in the real world form a tangled web of dependencies.

A Bayesian network represents this web compactly. Instead of specifying a probability for every possible combination of events (which grows exponentially), we only specify local dependencies — how each variable relates to its direct causes.

The core idea: Don't specify the full joint distribution. Instead, factor it into small, local conditional probability tables. The graph structure tells you which variables depend on which. Everything else follows from the chain rule of probability.
The Rain-Sprinkler-Grass Network

Click nodes to toggle their state (true/false). Watch how evidence propagates through the network.

Check: Why don't we just list all combinations of variable states?

Chapter 1: DAGs — Directed Acyclic Graphs

A Bayesian network is a DAG: a Directed Acyclic Graph. Each node is a random variable. Each directed edge (arrow) means "this variable directly influences that one." The "acyclic" part means no loops — you can't follow arrows and end up where you started.

The graph encodes the factorization of the joint probability. For variables X1, ..., Xn with parents Pa(Xi):

P(X1, ..., Xn) = Πi P(Xi | Pa(Xi))

This factorization is why Bayes nets are efficient. Instead of one giant table with 2n entries, we have n small tables, each conditioned on only a few parents.

Savings in numbers: For the 4-node Cloudy/Rain/Sprinkler/Wet network, the full joint has 2&sup4; = 16 entries. The factored form needs: P(Cloudy) = 2, P(Rain|Cloudy) = 4, P(Sprinkler|Cloudy) = 4, P(Wet|Rain,Sprinkler) = 8. Total: 18. Not much savings here. But scale to 20 binary variables, each with at most 3 parents: factored = ~20 × 2³ = 160 parameters. Full joint = 220 = 1,048,576. That's a 6,500x compression. The graph structure is the compression.
DAG Terminology

Hover over nodes to see their parents, children, and Markov blanket highlighted.

Markov blanket: A node is conditionally independent of ALL other nodes given its Markov blanket (parents + children + children's other parents). This is the minimal set you need to predict a node.
🔨 Derivation Joint Distribution Factorization from DAG Structure ✓ ATTEMPTED

We stated that for a Bayesian network with variables X1, ..., Xn arranged in topological order:

P(X1, ..., Xn) = Πi=1n P(Xi | Pa(Xi))

Your task: Derive this factorization using only the chain rule of probability and the conditional independence assumptions encoded by the DAG. Show why removing edges (claiming independence) lets us drop variables from each conditional.

The chain rule (no assumptions) says: P(X1, ..., Xn) = P(X1) · P(X2|X1) · P(X3|X1,X2) · ... · P(Xn|X1,...,Xn-1). This is always true — it's not an approximation. The question is: can we simplify each term?
In a topological ordering, each Xi only depends on its parents, not all predecessors. If there's no edge from Xj to Xi, the DAG asserts Xi ⊥ Xj | Pa(Xi). So P(Xi | X1,...,Xi-1) simplifies to P(Xi | Pa(Xi)) because all non-parents are conditionally irrelevant.
In topological order, all parents of Xi appear before Xi. So Pa(Xi) ⊆ {X1, ..., Xi-1}. This means the conditional independence claim "given all predecessors, only parents matter" is well-defined — the parents are already in the conditioning set.

Full derivation:

Step 1. Chain rule (always valid):
P(X1,...,Xn) = Πi P(Xi | X1,...,Xi-1)

Step 2. Order variables topologically. For each Xi, the set {X1,...,Xi-1} contains all ancestors and some non-ancestors.

Step 3. The DAG encodes the local Markov property: each node is conditionally independent of its non-descendants given its parents. Since in topological order all predecessors are either parents or non-descendants:
P(Xi | X1,...,Xi-1) = P(Xi | Pa(Xi))

Step 4. Substitute into the chain rule:
P(X1,...,Xn) = Πi P(Xi | Pa(Xi))   ■

The key insight: The factorization isn't an approximation — it's exact IF the independence assumptions are true. Every missing edge is an independence claim. More missing edges = smaller CPTs = more compression. But claim a false independence and your model is wrong.

Check: What does the "acyclic" constraint mean?

Chapter 2: Conditional Probability Tables

Each node stores a CPT (Conditional Probability Table). For a root node (no parents), it's just a prior: P(Rain) = 0.2. For a node with parents, it specifies the probability for each combination of parent states: P(Wet Grass | Rain=T, Sprinkler=T) = 0.99.

The CPT is where domain knowledge lives. A doctor building a medical Bayes net fills in CPTs like P(Cough | Flu=yes, Smoker=yes) = 0.85 based on clinical experience or data.

Interactive CPT Explorer

Click nodes to see their CPT. Adjust probabilities with the sliders below.

Click a node to view its CPT
RainSprinklerP(Wet=T)
FF0.01
FT0.90
TF0.80
TT0.99
Size matters: A node with k binary parents needs a CPT with 2k rows. This is why Bayes nets work best when each node has few parents. The graph structure keeps CPTs small.
CPTs as arrays, concretely: For our 3-node Rain/Sprinkler/Wet network, the CPTs are just small arrays. P(Rain) is [2] = [0.2, 0.8]. P(Sprinkler|Rain) is [2,2] = [[0.4, 0.6], [0.01, 0.99]]. P(Wet|Sprinkler,Rain) is [2,2,2] = 8 entries. Total parameters: 2 + 4 + 8 = 14. Without the network, the full joint over 3 binary variables requires 2³ = 8 entries. Here the savings are modest, but with 10 variables (each with ≤2 parents) you store ~60 parameters instead of 210 = 1,024. With 50 variables, it's thousands vs. 250 ≈ 1015.
python
import numpy as np

# CPTs are just arrays indexed by parent states
P_rain = np.array([0.2, 0.8])           # [T, F]
P_sprinkler_given_rain = np.array([       # [rain_state, sprinkler_state]
    [0.4, 0.6],    # Rain=T: 40% sprinkler on
    [0.01, 0.99]   # Rain=F: 1% sprinkler on
])
P_wet_given_sr = np.array([              # [sprinkler, rain, wet_state]
    [[0.99, 0.01], [0.80, 0.20]],  # S=T
    [[0.90, 0.10], [0.01, 0.99]]   # S=F
])

# Total params: 2 + 4 + 8 = 14
# Full joint would be 2^3 = 8 (modest savings)
# With 20 sparse variables: ~100 vs 2^20 = 1,048,576
Check: How many rows does a CPT have for a binary node with 3 binary parents?

Chapter 3: Conditional Independence

The whole point of a Bayes net is to encode conditional independencies. Two variables X and Y are conditionally independent given Z (written X ⊥ Y | Z) if knowing Z makes X irrelevant for predicting Y.

Example: Does knowing the season help you predict if the grass is wet? Yes. But if you already know whether it rained AND whether the sprinkler was on, then season adds nothing. Grass is conditionally independent of Season given {Rain, Sprinkler}.

Why it matters: Conditional independence is what makes inference tractable. Without it, we'd need to consider all variables simultaneously. With it, we can reason locally and propagate information through the graph.
Independence Tester

Select two variables and a conditioning set. The display shows whether they're conditionally independent in this network.

X ⊥ Y | Z  ⇔  P(X | Y, Z) = P(X | Z)

In practice, this means conditional independence lets you drop variables from computations. If you know Rain and Sprinkler, you don't need to ask about Season to predict Wet Grass. Every dropped variable halves (for binary) the size of the table you need to compute. With 5 irrelevant variables dropped, that's a 2&sup5; = 32x speedup.

Check: X ⊥ Y | Z means...
Checkpoint — Before you move on
In your own words: why does conditional independence make Bayesian networks computationally tractable? Give a concrete example showing how knowing certain variables lets you ignore others.
✓ Gate cleared
Model Answer

Without conditional independence, computing P(Xi | all other variables) requires a table exponential in the number of variables. Conditional independence lets us replace "all other variables" with just the parents — typically 2-3 nodes. Concrete example: In a 20-node network where each node has at most 3 parents, a CPT has at most 2³ = 8 rows. Without independence, you'd need one entry per combination of all other 19 variables: 219 = 524,288 entries. That's a 65,000x reduction per node. Multiply across 20 nodes and the full joint's 220 = 1M entries compress to ~160 parameters total.

The key: conditional independence isn't just "nice to have" — it's the entire reason graphical models exist. The graph structure IS the set of independence claims.

Chapter 4: D-Separation

How do you read off conditional independencies from the graph structure? The answer is d-separation. It's a purely graphical criterion: check if information can flow between two nodes along a path, given what you've observed.

There are three fundamental structures. Each behaves differently when the middle node is observed vs. unobserved:

Chain: A → B → C
B blocks the path when observed. (Knowing the mediator screens off the cause from the effect.)
Fork: A ← B → C
B blocks when observed. (Knowing the common cause makes the effects independent.)
Collider: A → B ← C
B blocks when NOT observed. Observing B OPENS the path! (Explaining away.)
D-Separation Visualizer

Click the middle node to toggle observing it. Watch how information flow (green = flows, red = blocked) changes for each structure.

The collider is the tricky one. Unobserved, it blocks flow (the causes are independent). But once you observe the collider or any of its descendants, it opens the path. This is called "explaining away" — if you know the grass is wet and it didn't rain, the sprinkler probably was on.
D-separation as an algorithm. To check if X ⊥ Y | Z, examine every undirected path between X and Y. For each path, check every intermediate node B:
Chain (A→B→C) or Fork (A←B→C): B blocks the path if B ∈ Z.
Collider (A→B←C): B blocks the path if B ∉ Z and no descendant of B is in Z.
If every path is blocked, X and Y are d-separated (independent) given Z. If even one path is open, they are dependent.
Check: In A → B ← C (collider), when does observing B affect A-C independence?
🔨 Derivation Prove Why Observing a Collider Opens the Path ✓ ATTEMPTED

Consider the collider structure: A → C ← B, where A and B are independent causes of C. Let A = "earthquake", B = "burglar", C = "alarm sounds".

P(A) = 0.01, P(B) = 0.05, P(C=T|A,B) = 0.99, P(C=T|A,¬B) = 0.95, P(C=T|¬A,B) = 0.90, P(C=T|¬A,¬B) = 0.01.

Your task: Compute P(A|C=T) and P(A|C=T, B=T). Show numerically that learning B=true decreases the probability of A — demonstrating explaining away.

Use Bayes' rule: P(A=T|C=T) = P(C=T|A=T) · P(A=T) / P(C=T). You need P(C=T|A=T) = sum over B: P(C=T|A=T,B) · P(B). And P(C=T) = sum over A,B: P(C=T|A,B) · P(A) · P(B).
P(C=T|A=T) = P(C=T|A=T,B=T)·P(B=T) + P(C=T|A=T,B=F)·P(B=F) = 0.99·0.05 + 0.95·0.95 = 0.0495 + 0.9025 = 0.952.
P(C=T) = P(C=T|A=T)·P(A=T) + P(C=T|A=F)·P(A=F). You already have P(C=T|A=T)=0.952. For P(C=T|A=F): 0.90·0.05 + 0.01·0.95 = 0.045 + 0.0095 = 0.0545. So P(C=T) = 0.952·0.01 + 0.0545·0.99 = 0.00952 + 0.05395 = 0.06347.

Part 1: P(A=T | C=T)

P(A=T|C=T) = P(C=T|A=T)·P(A=T) / P(C=T) = 0.952 · 0.01 / 0.06347 = 0.150

The alarm went off, so earthquake probability jumps from 1% to 15%. Makes sense — something caused the alarm.

Part 2: P(A=T | C=T, B=T)

Now we know a burglar IS present. P(A=T|C=T,B=T) = P(C=T|A=T,B=T)·P(A=T) / P(C=T|B=T)
P(C=T|B=T) = P(C=T|A=T,B=T)·P(A=T) + P(C=T|A=F,B=T)·P(A=F) = 0.99·0.01 + 0.90·0.99 = 0.0099 + 0.891 = 0.9009
P(A=T|C=T,B=T) = 0.99·0.01 / 0.9009 = 0.011

The key insight: P(A|C=T) = 0.150, but P(A|C=T,B=T) = 0.011. Learning that a burglar triggered the alarm "explains away" the earthquake — we no longer need the earthquake hypothesis. This is why colliders open paths: once you observe the effect, its causes become coupled through competition for explanatory credit.

⚔ Adversarial: The Explaining Away Trap
A patient has symptom S. Two diseases D1 and D2 can independently cause S. You observe S=true. You then run a test and confirm D1=true. What happens to P(D2 | S=true, D1=true) compared to P(D2 | S=true)?
💥 Break-It Lab What Dies When You Add/Remove Edges? ✓ ATTEMPTED
A 4-node network: Season → Rain, Season → Sprinkler, Rain → Wet, Sprinkler → Wet. Toggle edges to see how independence structure breaks.
Add edge: Rain → Sprinkler OFF
Failure mode: This destroys the conditional independence Rain ⊥ Sprinkler | Season. Now even knowing the season, rain directly affects sprinkler — which is causally wrong. The network over-counts rain's influence on wet grass (once directly through Rain→Wet, and again through Rain→Sprinkler→Wet).
Remove edge: Season → Rain OFF
Failure mode: Rain is now a root node (no parents). Its probability can't depend on season. In winter, the model predicts the same rain probability as summer — missing a key dependency. The CPT P(Rain) becomes a single fixed number instead of a seasonal conditional.
Observe Wet (opens collider path) OFF
Failure mode: Wet is a collider for Rain and Sprinkler. Observing it creates a spurious dependency between Rain and Sprinkler — even though they have no causal connection. This is explaining away in action: if the grass is wet and it didn't rain, the sprinkler must have been on.
Checkpoint — Before you move on
Explain in your own words: given the network A → B → C ← D, is A independent of D given B? Given C? Walk through the d-separation algorithm for each case.
✓ Gate cleared
Model Answer

Path: A → B → C ← D. There's only one path connecting A to D.

Given B: At node B, the structure is a chain (A→B→C). Observing B blocks this chain segment. The path is blocked. A ⊥ D | B. ✓

Given C: At node B, the structure is a chain (A→B→C) — B is NOT observed, so this segment is open. At node C, the structure is a collider (B→C←D). Observing C OPENS this collider. Both segments are open, so the path is active. A is NOT independent of D given C. ✗

This is the critical insight: observing a collider's child creates a dependency between causes that were previously independent. You MUST understand this before inference, because variable elimination exploits these independence structures to determine which factors interact.

Chapter 5: Exact Inference — Variable Elimination

Given evidence (some variables observed), we want to compute the posterior probability of a query variable. The brute-force way: sum over all combinations of unobserved variables. But that's exponential. Variable elimination does it smarter by summing out variables one at a time, exploiting the factored structure.

1. Choose elimination order
Pick a variable to sum out
2. Multiply relevant factors
Combine all CPTs involving this variable
3. Sum out the variable
Marginalize: ΣX f(X, ...)
4. Repeat until only query remains
Normalize to get P(query | evidence)
Variable Elimination Step-Through

Query: P(Rain | Wet=true). Click "Next Step" to eliminate variables one at a time. Watch the factors combine and shrink.

Step 0: Initial factors
The elimination order matters! A bad order can create huge intermediate factors. Finding the optimal order is NP-hard, but good heuristics (min-degree, min-fill) work well in practice.

Let's trace through with actual numbers. Query: P(Rain=T | Wet=T). Our factors are P(R), P(S|R), P(W|R,S). We set W=T and eliminate S first, then normalize.

python
# P(Rain=T | Wet=T) — worked example

# Step 1: Set evidence W=T in P(W|R,S)
#   f(R=T,S=T) = P(W=T|R=T,S=T) = 0.99
#   f(R=T,S=F) = P(W=T|R=T,S=F) = 0.80
#   f(R=F,S=T) = P(W=T|R=F,S=T) = 0.90
#   f(R=F,S=F) = P(W=T|R=F,S=F) = 0.01

# Step 2: Eliminate S — sum over S in P(S|R)*f(R,S)
#   g(R=T) = P(S=T|R=T)*f(T,T) + P(S=F|R=T)*f(T,F)
#          = 0.40*0.99 + 0.60*0.80 = 0.396 + 0.480 = 0.876
#   g(R=F) = P(S=T|R=F)*f(F,T) + P(S=F|R=F)*f(F,F)
#          = 0.01*0.90 + 0.99*0.01 = 0.009 + 0.0099 = 0.0189

# Step 3: Multiply by P(R) and normalize
#   h(R=T) = P(R=T)*g(R=T) = 0.2 * 0.876  = 0.1752
#   h(R=F) = P(R=F)*g(R=F) = 0.8 * 0.0189 = 0.01512
#   P(R=T|W=T) = 0.1752 / (0.1752+0.01512) = 0.920
The payoff: Instead of summing over all 2³ = 8 entries of the joint, we summed over 2 values of S, then 2 values of R. The intermediate factor g(R) had only 2 entries. For larger networks, this difference is the gap between seconds and years of computation.
Check: What does variable elimination do differently from brute-force summation?
🔨 Derivation Variable Elimination Complexity — Why Order Matters ✓ ATTEMPTED

Consider a chain A → B → C → D → E with all binary variables. You want P(E | A=true).

Your task: (1) Count the operations needed to eliminate variables in order B, C, D. (2) Now try the "bad" order D, C, B. Show why, for a chain, both orders give the same cost — but explain intuitively why for a different graph structure (like a tree with branching factor 3), elimination order dramatically matters.

When you eliminate B, you multiply P(B|A) [size 2×2] with P(C|B) [size 2×2] and sum over B. The result is a factor f(A,C) of size 2×2 = 4 entries. Each entry requires 2 multiplications and 1 addition (summing over B's 2 states). Cost: 4 entries × 3 ops = 12 operations.
In a chain, each variable connects to at most 2 others. When you eliminate a variable, you combine its two adjacent factors into one factor over the neighbors. The resulting factor always has scope 2 (the two neighbors). So intermediate factors never grow larger than 2×2 = 4 entries. The key: maximum factor width = treewidth + 1 = 2 for a chain.
If you eliminate a leaf first, it just sends a message to its parent (factor size = 2). Good. But if you eliminate the ROOT first, you must combine it with all k children simultaneously, creating a factor of size 2k. For k=10 children, that's 1024 entries in one intermediate factor. The lesson: eliminate leaves before roots. This is the min-degree heuristic.

Chain A→B→C→D→E, query P(E|A=T):

Good order (B, C, D):
1. Eliminate B: combine P(B|A=T)[2] × P(C|B)[2×2], sum over B → f(C)[2]. Cost: 4 mult + 2 add = 6 ops.
2. Eliminate C: combine f(C)[2] × P(D|C)[2×2], sum over C → g(D)[2]. Cost: 4 mult + 2 add = 6 ops.
3. Eliminate D: combine g(D)[2] × P(E|D)[2×2], sum over D → h(E)[2]. Cost: 4 mult + 2 add = 6 ops.
Total: 18 operations. Max intermediate factor size: 4.

Reverse order (D, C, B):
1. Eliminate D: combine P(D|C)[2×2] × P(E|D)[2×2], sum over D → f(C,E)[2×2]. Cost: 8 mult + 4 add = 12.
2. Eliminate C: combine P(C|B)[2×2] × f(C,E)[2×2], sum over C → g(B,E)[2×2]. Cost: 8 mult + 4 add = 12.
3. Eliminate B: combine P(B|A=T)[2] × g(B,E)[2×2], sum over B → h(E)[2]. Cost: 4 mult + 2 add = 6.
Total: 30 operations. Max intermediate factor size: 4.

The key insight: For chains, both orders produce factors of size ≤ 4 — the difference is modest. But for a tree with branching factor k=10: eliminating the root first creates a 210=1024-entry factor, while eliminating leaves first never exceeds 4 entries. The max intermediate factor size = 2(treewidth+1), and elimination order determines the effective treewidth. Finding the optimal order is NP-hard in general.

💻 Build It Implement Variable Elimination from Scratch ✓ ATTEMPTED
Implement variable elimination for binary Bayesian networks. Given a network (dict of CPTs), evidence (observed variables), query variable, and elimination order — compute the posterior distribution of the query variable.
signature def variable_elimination(network, evidence, query, elim_order): """ Args: network: dict mapping node_name -> { 'parents': [parent_names], 'cpt': nested list indexed by [parent1_state][parent2_state]...[self_state] } evidence: dict mapping node_name -> 0 or 1 query: str, name of query variable elim_order: list of variable names to eliminate (excludes query and evidence) Returns: [P(query=0), P(query=1)] -- normalized posterior """
Test case
Rain/Sprinkler/Wet network. Query: P(Rain | Wet=T), elim_order=['Sprinkler'].
Expected: [P(Rain=T|Wet=T), P(Rain=F|Wet=T)] ≈ [0.708, 0.292] (with P(Rain)=0.2, P(Sprinkler)=0.4).
A factor is (scope, table) where scope is a list of variable names and table is a flat dict mapping tuples of assignments to values. E.g., factor P(W|R,S) with evidence W=T becomes scope=['R','S'], table={(0,0):0.99, (0,1):0.80, (1,0):0.90, (1,1):0.01}. To multiply two factors: take the union of their scopes, iterate over all joint assignments, multiply matching entries.
python
from itertools import product

def variable_elimination(network, evidence, query, elim_order):
    # Step 1: Build initial factors from CPTs
    factors = []
    for node, info in network.items():
        scope = info['parents'] + [node]
        cpt = info['cpt']
        table = {}
        for assignment in product([0,1], repeat=len(scope)):
            # Index into nested CPT
            val = cpt
            for idx in assignment:
                val = val[idx]
            table[assignment] = val
        factors.append((scope, table))

    # Step 2: Restrict evidence
    for i, (scope, table) in enumerate(factors):
        new_scope = [v for v in scope if v not in evidence]
        if new_scope == scope:
            continue
        new_table = {}
        for assignment in product([0,1], repeat=len(new_scope)):
            full = list(assignment)
            for j, v in enumerate(scope):
                if v in evidence:
                    full.insert(j, evidence[v])
            new_table[assignment] = table[tuple(full)]
        factors[i] = (new_scope, new_table)

    # Step 3: Eliminate variables one by one
    for var in elim_order:
        # Collect factors mentioning var
        relevant = [(s,t) for s,t in factors if var in s]
        remaining = [(s,t) for s,t in factors if var not in s]

        # Multiply all relevant factors
        combined_scope = list(dict.fromkeys(
            v for s,_ in relevant for v in s))
        combined = {}
        for assignment in product([0,1], repeat=len(combined_scope)):
            amap = dict(zip(combined_scope, assignment))
            val = 1.0
            for s, t in relevant:
                key = tuple(amap[v] for v in s)
                val *= t[key]
            combined[assignment] = val

        # Sum out var
        var_idx = combined_scope.index(var)
        new_scope = [v for v in combined_scope if v != var]
        new_table = {}
        for assignment in product([0,1], repeat=len(new_scope)):
            total = 0.0
            for val in [0, 1]:
                full = list(assignment)
                full.insert(var_idx, val)
                total += combined[tuple(full)]
            new_table[assignment] = total
        remaining.append((new_scope, new_table))
        factors = remaining

    # Step 4: Multiply remaining, normalize
    result = [0.0, 0.0]
    for qval in [0, 1]:
        p = 1.0
        for scope, table in factors:
            key = tuple(qval if v == query else evidence[v]
                        for v in scope)
            p *= table.get(key, table.get((qval,), 1.0))
        result[qval] = p
    total = sum(result)
    return [r/total for r in result]
Bonus challenge: Modify to return the optimal elimination order using the min-degree heuristic (always eliminate the variable that creates the smallest intermediate factor).

Chapter 6: Belief Propagation

For tree-structured networks (no undirected loops), there's an elegant alternative: belief propagation (the sum-product algorithm). Each node sends messages to its neighbors summarizing what it knows. After two passes (leaves to root, root to leaves), every node has the exact marginal posterior.

A message from node i to node j says: "Based on everything I've seen on my side of the tree, here's what I believe about your state." It's local computation with global results.

mi→j(xj) = Σxi φ(xi, xj) · Πk≠j mk→i(xi)
Message Passing Animation

Watch messages flow through a tree. Click "Send Messages" to see each round of propagation. After two passes, all beliefs are exact.

Round 0: No messages sent
Loopy BP: For networks with loops, we can still run message passing — it's not guaranteed to converge or be exact, but in practice it often works well. This is the basis of many practical inference systems.

In implementation, each message is just an array of floats — one entry per state of the receiving node. For binary variables, every message is a [2] vector. A node computes its belief by multiplying all incoming messages element-wise and normalizing. For our 7-node tree, that's 12 messages (one per edge direction), each of size 2. Total computation: 24 multiplications + 7 normalizations. Compare that to the full joint: 2&sup7; = 128 entries.

Check: When is belief propagation exact?
🔗 Pattern Recognition
HMMs Are Just Chain-Structured Bayes Nets
This Lesson (Bayes Net)
Belief propagation on a tree: messages flow from leaves to root and back. Each message mi→j(xj) = Σxi φ(xi,xj) Πk mk→i(xi)
HMM (Forward-Backward)
Forward variable: αt(j) = Σi αt-1(i) · aij · bj(ot). Backward: βt(i) = Σj aij · bj(ot+1) · βt+1(j) → HMM Lesson

The HMM forward-backward algorithm IS belief propagation on a chain. The "forward" pass sends messages left-to-right (leaf → root). The "backward" pass sends right-to-left (root → leaves). The "message" at each step is exactly the α or β vector. The transition matrix A plays the role of φ, and the emission probability b plays the role of the local evidence. If you understand BP, you already understand forward-backward — it's the same algorithm on a specific graph topology.

Can you see the analogy further? What would "loopy BP" mean for an HMM? (Hint: think about HMMs with skip connections or factorial HMMs.)

Chapter 7: Interactive DAG Builder

Build your own Bayesian network! Click to add nodes, drag between nodes to add edges. Set CPTs and run inference queries. This is the full SLAM experience — but for probability.

DAG Builder
Click canvas to add nodes. Switch to edge mode to connect them.
Tips: Keep it simple — 3-5 nodes is enough to see interesting behavior. Try building the classic "Asia" network (visit to Asia, tuberculosis, smoking, lung cancer, bronchitis, X-ray, dyspnea).

Chapter 8: Learning Structure

So far we've assumed the graph structure is given. But what if you have data and need to discover the dependencies? This is structure learning — arguably the hardest problem in Bayesian networks.

There are two main approaches: score-based (search over possible graphs, score each one) and constraint-based (test conditional independencies in the data, build the graph from the results).

Score-Based (e.g., BIC, BDeu)
Search over DAGs, score each by how well it fits the data while penalizing complexity. Like model selection.
vs.
Constraint-Based (e.g., PC algorithm)
Run independence tests on data. If X ⊥ Y | Z, remove the edge. Determines skeleton, then orient edges.
Structure Learning Demo

Generate data from a hidden network. Click "Learn Structure" to watch the PC algorithm discover the DAG by testing conditional independencies.

Click Learn Step to begin
Caution: Data alone can't distinguish between some graph structures (Markov equivalence classes). A → B and A ← B encode the same independencies! You need interventional data or prior knowledge to resolve these ambiguities.
Check: Why can't we always uniquely learn the DAG from observational data?

Chapter 9: Connections & Beyond

Bayesian networks are everywhere. An HMM (Hidden Markov Model) is a special case — a chain-structured Bayes net where hidden states emit observations. A Kalman filter is a continuous Gaussian Bayes net with linear dynamics.

ModelStructureVariablesKey Algorithm
Bayes NetGeneral DAGDiscrete/continuousVariable elimination
HMMChainDiscrete hiddenForward-backward
Kalman FilterChainContinuous GaussianPredict-update
MRF/CRFUndirectedAnyBelief propagation
Causal ModelDAG + interventionsAnydo-calculus
Related microLessons: Bayes Filter is the recursive inference algorithm for chain-structured Bayes nets. Bayes Estimation covers the continuous-variable case. HMMs are a special case studied in detail.
Implementation: A Bayes net is just a DAG + CPTs. In code, that's a dictionary of nodes, each storing its parents and its conditional probability table:
python
# A Bayes net = DAG + CPTs. That's it.
network = {
    'Rain':      { 'parents': [],
                   'cpt': [0.2, 0.8] },
    'Sprinkler': { 'parents': ['Rain'],
                   'cpt': [[0.4, 0.6], [0.01, 0.99]] },
    'Wet':       { 'parents': ['Sprinkler', 'Rain'],
                   'cpt': [[[0.99,0.01],[0.8,0.2]],
                           [[0.9,0.1],[0.01,0.99]]] }
}
# Inference = message passing on this data structure
# To query P(Rain|Wet=T): variable elimination over this dict
What breaks at scale: Exact inference in general Bayes nets is NP-hard (for multiply-connected / loopy networks, the intermediate factors in variable elimination can grow exponentially). For networks with >50 densely connected nodes, exact methods are hopeless. Approximate methods step in: MCMC (sample from the posterior by random walks), variational inference (fit a tractable approximation), or loopy belief propagation (run message passing despite loops — not guaranteed to converge, but often works).

This computational wall is a key reason neural networks replaced Bayes nets for perception tasks. A Bayes net for image classification would need one node per pixel, with hand-designed dependencies between neighboring pixels — thousands of nodes, each with CPTs you'd somehow need to specify. A neural network learns these dependencies automatically via gradient descent, scaling to millions of parameters without anyone specifying a single conditional probability. The tradeoff: neural networks are black boxes, while Bayes nets give you interpretable, auditable reasoning.

🏗 Design Challenge You're the Architect: Bedside Diagnostic Bayesian Network ✓ ATTEMPTED
You're building a Bayesian network for a bedside diagnostic device in a rural clinic. The system must reason about 10 diseases from 30 observable symptoms (binary: present/absent). Doctors enter symptoms on a tablet and get ranked differential diagnoses. The device has limited compute (Raspberry Pi 4, 4GB RAM) and must return results in under 2 seconds.
Diseases
10 (binary: present/absent)
Symptoms
30 (binary observations)
Compute
Raspberry Pi 4, 4GB RAM
Latency
< 2 seconds for full posterior
Missing data
Typically 5-10 symptoms unobserved
Comorbidity
Patient may have 0-3 diseases simultaneously
1. Graph structure: Should diseases be root nodes with symptoms as children (naive Bayes style), or should you allow disease-disease edges (comorbidity) and symptom-symptom edges (symptom cascades)?
2. CPT representation: With 10 potential disease-parents per symptom, a full CPT would have 210 = 1024 entries per symptom. How do you make this tractable?
3. Inference algorithm: Given your graph structure choice, is exact inference feasible? If not, what approximate method fits the latency budget?
4. Missing symptoms: 5-10 symptoms will be unobserved. How does your inference handle marginalizing over missing data efficiently?

Real-world solution (QMR-DT, Internist-1/QMR):

Structure: Two-layer bipartite graph. Diseases as roots, symptoms as leaves. NO disease-disease edges (comorbidity handled by the model assuming multiple diseases can be present simultaneously). NO symptom-symptom edges (keeps treewidth low).

CPT compression via Noisy-OR: Instead of 2k entries per symptom, use the noisy-OR assumption: each disease independently has a probability of causing the symptom, and the symptom appears if ANY cause fires. P(S=F | D1,...,Dk) = Πi: Di=T (1 - qi). This reduces 1024 parameters to just 10 (one qi per disease). Total network parameters: 10 priors + 30×10 = 310 instead of 30×1024 = 30,720.

Inference: With noisy-OR and the bipartite structure, exact inference via variable elimination is feasible because the effective treewidth is bounded by the number of active diseases (typically 1-3). For the worst case (all 10 diseases), use likelihood weighting (a simple Monte Carlo method) with 1000 samples — takes ~50ms on a Pi 4.

Missing data: Unobserved symptoms are simply not set as evidence — variable elimination naturally marginalizes over them. This is one of Bayes nets' biggest advantages: missing data requires no special handling.

🔗 Pattern Recognition
From Correlation to Causation: Bayes Nets + do-calculus
Bayes Net (This Lesson)
P(Y | X=x) — "What is Y if we OBSERVE X=x?" Condition on evidence. Information flows through all active paths including colliders.
Causal Model (do-calculus)
P(Y | do(X=x)) — "What is Y if we INTERVENE to set X=x?" Cut all incoming edges to X, then compute. Blocks backdoor paths.

A Bayesian network encodes correlational structure: "these variables tend to co-occur." A causal model adds one crucial distinction: the difference between observing and intervening. Observing X=x means conditioning — all paths remain. Intervening (do(X=x)) means forcing X to a value, which severs X from its parents (you delete all incoming edges to X). Same graph, profoundly different semantics. Pearl's do-calculus provides rules for when P(Y|do(X)) can be computed from observational data alone — and it relies entirely on the d-separation machinery you just learned.

If you observe that people who carry lighters have higher lung cancer rates, the Bayes net says P(Cancer | Lighter=yes) > P(Cancer). But does GIVING someone a lighter cause cancer? What edges would you cut?

"Probability theory is nothing but common sense reduced to calculation."
— Pierre-Simon Laplace

You now understand the language of probabilistic reasoning with graphs. From medical diagnosis to robot perception, Bayesian networks turn uncertain evidence into principled decisions.