microBayesNet — Reasoning Under Uncertainty with Graphs

Chapter 1: DAGs — Directed Acyclic Graphs

A Bayesian network is a DAG: a Directed Acyclic Graph. Each node is a random variable. Each directed edge (arrow) means "this variable directly influences that one." The "acyclic" part means no loops — you can't follow arrows and end up where you started.

The graph encodes the factorization of the joint probability. For variables X₁, ..., X_n with parents Pa(X_i):

P(X₁, ..., X_n) = Π_i P(X_i | Pa(X_i))

This factorization is why Bayes nets are efficient. Instead of one giant table with 2ⁿ entries, we have n small tables, each conditioned on only a few parents.

Savings in numbers: For the 4-node Cloudy/Rain/Sprinkler/Wet network, the full joint has 2⁴ = 16 entries. The factored form needs: P(Cloudy) = 2, P(Rain|Cloudy) = 4, P(Sprinkler|Cloudy) = 4, P(Wet|Rain,Sprinkler) = 8. Total: 18. Not much savings here. But scale to 20 binary variables, each with at most 3 parents: factored = ~20 × 2³ = 160 parameters. Full joint = 2²⁰ = 1,048,576. That's a 6,500x compression. The graph structure is the compression.

DAG Terminology

Hover over nodes to see their parents, children, and Markov blanket highlighted.

Markov blanket: A node is conditionally independent of ALL other nodes given its Markov blanket (parents + children + children's other parents). This is the minimal set you need to predict a node.

🔨 Derivation Joint Distribution Factorization from DAG Structure ▶ ✓ ATTEMPTED

We stated that for a Bayesian network with variables X₁, ..., X_n arranged in topological order:

P(X₁, ..., X_n) = Π_i=1ⁿ P(X_i | Pa(X_i))

Your task: Derive this factorization using only the chain rule of probability and the conditional independence assumptions encoded by the DAG. Show why removing edges (claiming independence) lets us drop variables from each conditional.

The chain rule (no assumptions) says: P(X₁, ..., X_n) = P(X₁) · P(X₂|X₁) · P(X₃|X₁,X₂) · ... · P(X_n|X₁,...,X_n-1). This is always true — it's not an approximation. The question is: can we simplify each term?

In a topological ordering, each X_i only depends on its parents, not all predecessors. If there's no edge from X_j to X_i, the DAG asserts X_i ⊥ X_j | Pa(X_i). So P(X_i | X₁,...,X_i-1) simplifies to P(X_i | Pa(X_i)) because all non-parents are conditionally irrelevant.

In topological order, all parents of X_i appear before X_i. So Pa(X_i) ⊆ {X₁, ..., X_i-1}. This means the conditional independence claim "given all predecessors, only parents matter" is well-defined — the parents are already in the conditioning set.

Full derivation:

Step 1. Chain rule (always valid):
P(X₁,...,X_n) = Π_i P(X_i | X₁,...,X_i-1)

Step 2. Order variables topologically. For each X_i, the set {X₁,...,X_i-1} contains all ancestors and some non-ancestors.

Step 3. The DAG encodes the local Markov property: each node is conditionally independent of its non-descendants given its parents. Since in topological order all predecessors are either parents or non-descendants:
P(X_i | X₁,...,X_i-1) = P(X_i | Pa(X_i))

Step 4. Substitute into the chain rule:
P(X₁,...,X_n) = Π_i P(X_i | Pa(X_i)) ■

The key insight: The factorization isn't an approximation — it's exact IF the independence assumptions are true. Every missing edge is an independence claim. More missing edges = smaller CPTs = more compression. But claim a false independence and your model is wrong.

Check: What does the "acyclic" constraint mean?

Edges must alternate direction All nodes must have parents No directed cycles — you can't follow arrows in a loop

Chapter 2: Conditional Probability Tables

Each node stores a CPT (Conditional Probability Table). For a root node (no parents), it's just a prior: P(Rain) = 0.2. For a node with parents, it specifies the probability for each combination of parent states: P(Wet Grass | Rain=T, Sprinkler=T) = 0.99.

The CPT is where domain knowledge lives. A doctor building a medical Bayes net fills in CPTs like P(Cough | Flu=yes, Smoker=yes) = 0.85 based on clinical experience or data.

Interactive CPT Explorer

Click nodes to see their CPT. Adjust probabilities with the sliders below.

Click a node to view its CPT

Rain	Sprinkler	P(Wet=T)
F	F	0.01
F	T	0.90
T	F	0.80
T	T	0.99

Size matters: A node with k binary parents needs a CPT with 2^k rows. This is why Bayes nets work best when each node has few parents. The graph structure keeps CPTs small.

CPTs as arrays, concretely: For our 3-node Rain/Sprinkler/Wet network, the CPTs are just small arrays. P(Rain) is [2] = [0.2, 0.8]. P(Sprinkler|Rain) is [2,2] = [[0.4, 0.6], [0.01, 0.99]]. P(Wet|Sprinkler,Rain) is [2,2,2] = 8 entries. Total parameters: 2 + 4 + 8 = 14. Without the network, the full joint over 3 binary variables requires 2³ = 8 entries. Here the savings are modest, but with 10 variables (each with ≤2 parents) you store ~60 parameters instead of 2¹⁰ = 1,024. With 50 variables, it's thousands vs. 2⁵⁰ ≈ 10¹⁵.

python
import numpy as np

# CPTs are just arrays indexed by parent states
P_rain = np.array([0.2, 0.8])           # [T, F]
P_sprinkler_given_rain = np.array([       # [rain_state, sprinkler_state]
    [0.4, 0.6],    # Rain=T: 40% sprinkler on
    [0.01, 0.99]   # Rain=F: 1% sprinkler on
])
P_wet_given_sr = np.array([              # [sprinkler, rain, wet_state]
    [[0.99, 0.01], [0.80, 0.20]],  # S=T
    [[0.90, 0.10], [0.01, 0.99]]   # S=F
])

# Total params: 2 + 4 + 8 = 14
# Full joint would be 2^3 = 8 (modest savings)
# With 20 sparse variables: ~100 vs 2^20 = 1,048,576

Check: How many rows does a CPT have for a binary node with 3 binary parents?

3 6 8 (2³)

Chapter 3: Conditional Independence

The whole point of a Bayes net is to encode conditional independencies. Two variables X and Y are conditionally independent given Z (written X ⊥ Y | Z) if knowing Z makes X irrelevant for predicting Y.

Example: Does knowing the season help you predict if the grass is wet? Yes. But if you already know whether it rained AND whether the sprinkler was on, then season adds nothing. Grass is conditionally independent of Season given {Rain, Sprinkler}.

Why it matters: Conditional independence is what makes inference tractable. Without it, we'd need to consider all variables simultaneously. With it, we can reason locally and propagate information through the graph.

Independence Tester

Select two variables and a conditioning set. The display shows whether they're conditionally independent in this network.

X ⊥ Y | Z ⇔ P(X | Y, Z) = P(X | Z)

In practice, this means conditional independence lets you drop variables from computations. If you know Rain and Sprinkler, you don't need to ask about Season to predict Wet Grass. Every dropped variable halves (for binary) the size of the table you need to compute. With 5 irrelevant variables dropped, that's a 2⁵ = 32x speedup.

Check: X ⊥ Y | Z means...

Given Z, knowing Y tells you nothing new about X X and Y never occur together Z causes both X and Y

Chapter 4: D-Separation

How do you read off conditional independencies from the graph structure? The answer is d-separation. It's a purely graphical criterion: check if information can flow between two nodes along a path, given what you've observed.

There are three fundamental structures. Each behaves differently when the middle node is observed vs. unobserved:

Chain: A → B → C

B blocks the path when observed. (Knowing the mediator screens off the cause from the effect.)

↓

Fork: A ← B → C

B blocks when observed. (Knowing the common cause makes the effects independent.)

↓

Collider: A → B ← C

B blocks when NOT observed. Observing B OPENS the path! (Explaining away.)

D-Separation Visualizer

Click the middle node to toggle observing it. Watch how information flow (green = flows, red = blocked) changes for each structure.

The collider is the tricky one. Unobserved, it blocks flow (the causes are independent). But once you observe the collider or any of its descendants, it opens the path. This is called "explaining away" — if you know the grass is wet and it didn't rain, the sprinkler probably was on.

D-separation as an algorithm. To check if X ⊥ Y | Z, examine every undirected path between X and Y. For each path, check every intermediate node B:
• Chain (A→B→C) or Fork (A←B→C): B blocks the path if B ∈ Z.
• Collider (A→B←C): B blocks the path if B ∉ Z and no descendant of B is in Z.
If every path is blocked, X and Y are d-separated (independent) given Z. If even one path is open, they are dependent.

Check: In A → B ← C (collider), when does observing B affect A-C independence?

Observing B opens the path — A and C become dependent Observing B blocks the path It has no effect

🔨 Derivation Prove Why Observing a Collider Opens the Path ▶ ✓ ATTEMPTED

Consider the collider structure: A → C ← B, where A and B are independent causes of C. Let A = "earthquake", B = "burglar", C = "alarm sounds".

P(A) = 0.01, P(B) = 0.05, P(C=T|A,B) = 0.99, P(C=T|A,¬B) = 0.95, P(C=T|¬A,B) = 0.90, P(C=T|¬A,¬B) = 0.01.

Your task: Compute P(A|C=T) and P(A|C=T, B=T). Show numerically that learning B=true decreases the probability of A — demonstrating explaining away.

P(C=T|A=T) = P(C=T|A=T,B=T)·P(B=T) + P(C=T|A=T,B=F)·P(B=F) = 0.99·0.05 + 0.95·0.95 = 0.0495 + 0.9025 = 0.952.

P(C=T) = P(C=T|A=T)·P(A=T) + P(C=T|A=F)·P(A=F). You already have P(C=T|A=T)=0.952. For P(C=T|A=F): 0.90·0.05 + 0.01·0.95 = 0.045 + 0.0095 = 0.0545. So P(C=T) = 0.952·0.01 + 0.0545·0.99 = 0.00952 + 0.05395 = 0.06347.

Part 1: P(A=T | C=T)

P(A=T|C=T) = P(C=T|A=T)·P(A=T) / P(C=T) = 0.952 · 0.01 / 0.06347 = 0.150

The alarm went off, so earthquake probability jumps from 1% to 15%. Makes sense — something caused the alarm.

Part 2: P(A=T | C=T, B=T)

The key insight: P(A|C=T) = 0.150, but P(A|C=T,B=T) = 0.011. Learning that a burglar triggered the alarm "explains away" the earthquake — we no longer need the earthquake hypothesis. This is why colliders open paths: once you observe the effect, its causes become coupled through competition for explanatory credit.

⚔ Adversarial: The Explaining Away Trap

A patient has symptom S. Two diseases D1 and D2 can independently cause S. You observe S=true. You then run a test and confirm D1=true. What happens to P(D2 | S=true, D1=true) compared to P(D2 | S=true)?

P(D2) decreases — D1 already explains the symptom, so D2 is less needed P(D2) increases — diseases tend to co-occur P(D2) stays the same — D1 and D2 are independent

💥 Break-It Lab What Dies When You Add/Remove Edges? ▶ ✓ ATTEMPTED

A 4-node network: Season → Rain, Season → Sprinkler, Rain → Wet, Sprinkler → Wet. Toggle edges to see how independence structure breaks.

Add edge: Rain → Sprinkler OFF

Failure mode: This destroys the conditional independence Rain ⊥ Sprinkler | Season. Now even knowing the season, rain directly affects sprinkler — which is causally wrong. The network over-counts rain's influence on wet grass (once directly through Rain→Wet, and again through Rain→Sprinkler→Wet).

Remove edge: Season → Rain OFF

Failure mode: Rain is now a root node (no parents). Its probability can't depend on season. In winter, the model predicts the same rain probability as summer — missing a key dependency. The CPT P(Rain) becomes a single fixed number instead of a seasonal conditional.

Observe Wet (opens collider path) OFF

Failure mode: Wet is a collider for Rain and Sprinkler. Observing it creates a spurious dependency between Rain and Sprinkler — even though they have no causal connection. This is explaining away in action: if the grass is wet and it didn't rain, the sprinkler must have been on.

Chapter 5: Exact Inference — Variable Elimination

Given evidence (some variables observed), we want to compute the posterior probability of a query variable. The brute-force way: sum over all combinations of unobserved variables. But that's exponential. Variable elimination does it smarter by summing out variables one at a time, exploiting the factored structure.

1. Choose elimination order

Pick a variable to sum out

↓

2. Multiply relevant factors

Combine all CPTs involving this variable

↓

3. Sum out the variable

Marginalize: Σ_X f(X, ...)

↓

4. Repeat until only query remains

Normalize to get P(query | evidence)

Variable Elimination Step-Through

Query: P(Rain | Wet=true). Click "Next Step" to eliminate variables one at a time. Watch the factors combine and shrink.

Step 0: Initial factors

The elimination order matters! A bad order can create huge intermediate factors. Finding the optimal order is NP-hard, but good heuristics (min-degree, min-fill) work well in practice.

Let's trace through with actual numbers. Query: P(Rain=T | Wet=T). Our factors are P(R), P(S|R), P(W|R,S). We set W=T and eliminate S first, then normalize.

python
# P(Rain=T | Wet=T) — worked example

# Step 1: Set evidence W=T in P(W|R,S)
#   f(R=T,S=T) = P(W=T|R=T,S=T) = 0.99
#   f(R=T,S=F) = P(W=T|R=T,S=F) = 0.80
#   f(R=F,S=T) = P(W=T|R=F,S=T) = 0.90
#   f(R=F,S=F) = P(W=T|R=F,S=F) = 0.01

# Step 2: Eliminate S — sum over S in P(S|R)*f(R,S)
#   g(R=T) = P(S=T|R=T)*f(T,T) + P(S=F|R=T)*f(T,F)
#          = 0.40*0.99 + 0.60*0.80 = 0.396 + 0.480 = 0.876
#   g(R=F) = P(S=T|R=F)*f(F,T) + P(S=F|R=F)*f(F,F)
#          = 0.01*0.90 + 0.99*0.01 = 0.009 + 0.0099 = 0.0189

# Step 3: Multiply by P(R) and normalize
#   h(R=T) = P(R=T)*g(R=T) = 0.2 * 0.876  = 0.1752
#   h(R=F) = P(R=F)*g(R=F) = 0.8 * 0.0189 = 0.01512
#   P(R=T|W=T) = 0.1752 / (0.1752+0.01512) = 0.920

The payoff: Instead of summing over all 2³ = 8 entries of the joint, we summed over 2 values of S, then 2 values of R. The intermediate factor g(R) had only 2 entries. For larger networks, this difference is the gap between seconds and years of computation.

Check: What does variable elimination do differently from brute-force summation?

Sums out variables one at a time, keeping intermediate factors small Uses gradient descent Samples random values

🔨 Derivation Variable Elimination Complexity — Why Order Matters ▶ ✓ ATTEMPTED

Consider a chain A → B → C → D → E with all binary variables. You want P(E | A=true).

Your task: (1) Count the operations needed to eliminate variables in order B, C, D. (2) Now try the "bad" order D, C, B. Show why, for a chain, both orders give the same cost — but explain intuitively why for a different graph structure (like a tree with branching factor 3), elimination order dramatically matters.

When you eliminate B, you multiply P(B|A) [size 2×2] with P(C|B) [size 2×2] and sum over B. The result is a factor f(A,C) of size 2×2 = 4 entries. Each entry requires 2 multiplications and 1 addition (summing over B's 2 states). Cost: 4 entries × 3 ops = 12 operations.

In a chain, each variable connects to at most 2 others. When you eliminate a variable, you combine its two adjacent factors into one factor over the neighbors. The resulting factor always has scope 2 (the two neighbors). So intermediate factors never grow larger than 2×2 = 4 entries. The key: maximum factor width = treewidth + 1 = 2 for a chain.

If you eliminate a leaf first, it just sends a message to its parent (factor size = 2). Good. But if you eliminate the ROOT first, you must combine it with all k children simultaneously, creating a factor of size 2^k. For k=10 children, that's 1024 entries in one intermediate factor. The lesson: eliminate leaves before roots. This is the min-degree heuristic.

Chain A→B→C→D→E, query P(E|A=T):

Good order (B, C, D):
1. Eliminate B: combine P(B|A=T)[2] × P(C|B)[2×2], sum over B → f(C)[2]. Cost: 4 mult + 2 add = 6 ops.
2. Eliminate C: combine f(C)[2] × P(D|C)[2×2], sum over C → g(D)[2]. Cost: 4 mult + 2 add = 6 ops.
3. Eliminate D: combine g(D)[2] × P(E|D)[2×2], sum over D → h(E)[2]. Cost: 4 mult + 2 add = 6 ops.
Total: 18 operations. Max intermediate factor size: 4.

Reverse order (D, C, B):
1. Eliminate D: combine P(D|C)[2×2] × P(E|D)[2×2], sum over D → f(C,E)[2×2]. Cost: 8 mult + 4 add = 12.
2. Eliminate C: combine P(C|B)[2×2] × f(C,E)[2×2], sum over C → g(B,E)[2×2]. Cost: 8 mult + 4 add = 12.
3. Eliminate B: combine P(B|A=T)[2] × g(B,E)[2×2], sum over B → h(E)[2]. Cost: 4 mult + 2 add = 6.
Total: 30 operations. Max intermediate factor size: 4.

The key insight: For chains, both orders produce factors of size ≤ 4 — the difference is modest. But for a tree with branching factor k=10: eliminating the root first creates a 2¹⁰=1024-entry factor, while eliminating leaves first never exceeds 4 entries. The max intermediate factor size = 2^{(treewidth+1)}, and elimination order determines the effective treewidth. Finding the optimal order is NP-hard in general.

💻 Build It Implement Variable Elimination from Scratch ▶ ✓ ATTEMPTED

Implement variable elimination for binary Bayesian networks. Given a network (dict of CPTs), evidence (observed variables), query variable, and elimination order — compute the posterior distribution of the query variable.

signature def variable_elimination(network, evidence, query, elim_order): """ Args: network: dict mapping node_name -> { 'parents': [parent_names], 'cpt': nested list indexed by [parent1_state][parent2_state]...[self_state] } evidence: dict mapping node_name -> 0 or 1 query: str, name of query variable elim_order: list of variable names to eliminate (excludes query and evidence) Returns: [P(query=0), P(query=1)] -- normalized posterior """

Test case

Rain/Sprinkler/Wet network. Query: P(Rain | Wet=T), elim_order=['Sprinkler'].
Expected: [P(Rain=T|Wet=T), P(Rain=F|Wet=T)] ≈ [0.708, 0.292] (with P(Rain)=0.2, P(Sprinkler)=0.4).

A factor is (scope, table) where scope is a list of variable names and table is a flat dict mapping tuples of assignments to values. E.g., factor P(W|R,S) with evidence W=T becomes scope=['R','S'], table={(0,0):0.99, (0,1):0.80, (1,0):0.90, (1,1):0.01}. To multiply two factors: take the union of their scopes, iterate over all joint assignments, multiply matching entries.

python
from itertools import product

def variable_elimination(network, evidence, query, elim_order):
    # Step 1: Build initial factors from CPTs
    factors = []
    for node, info in network.items():
        scope = info['parents'] + [node]
        cpt = info['cpt']
        table = {}
        for assignment in product([0,1], repeat=len(scope)):
            # Index into nested CPT
            val = cpt
            for idx in assignment:
                val = val[idx]
            table[assignment] = val
        factors.append((scope, table))

    # Step 2: Restrict evidence
    for i, (scope, table) in enumerate(factors):
        new_scope = [v for v in scope if v not in evidence]
        if new_scope == scope:
            continue
        new_table = {}
        for assignment in product([0,1], repeat=len(new_scope)):
            full = list(assignment)
            for j, v in enumerate(scope):
                if v in evidence:
                    full.insert(j, evidence[v])
            new_table[assignment] = table[tuple(full)]
        factors[i] = (new_scope, new_table)

    # Step 3: Eliminate variables one by one
    for var in elim_order:
        # Collect factors mentioning var
        relevant = [(s,t) for s,t in factors if var in s]
        remaining = [(s,t) for s,t in factors if var not in s]

        # Multiply all relevant factors
        combined_scope = list(dict.fromkeys(
            v for s,_ in relevant for v in s))
        combined = {}
        for assignment in product([0,1], repeat=len(combined_scope)):
            amap = dict(zip(combined_scope, assignment))
            val = 1.0
            for s, t in relevant:
                key = tuple(amap[v] for v in s)
                val *= t[key]
            combined[assignment] = val

        # Sum out var
        var_idx = combined_scope.index(var)
        new_scope = [v for v in combined_scope if v != var]
        new_table = {}
        for assignment in product([0,1], repeat=len(new_scope)):
            total = 0.0
            for val in [0, 1]:
                full = list(assignment)
                full.insert(var_idx, val)
                total += combined[tuple(full)]
            new_table[assignment] = total
        remaining.append((new_scope, new_table))
        factors = remaining

    # Step 4: Multiply remaining, normalize
    result = [0.0, 0.0]
    for qval in [0, 1]:
        p = 1.0
        for scope, table in factors:
            key = tuple(qval if v == query else evidence[v]
                        for v in scope)
            p *= table.get(key, table.get((qval,), 1.0))
        result[qval] = p
    total = sum(result)
    return [r/total for r in result]

Bonus challenge: Modify to return the optimal elimination order using the min-degree heuristic (always eliminate the variable that creates the smallest intermediate factor).

Chapter 6: Belief Propagation

For tree-structured networks (no undirected loops), there's an elegant alternative: belief propagation (the sum-product algorithm). Each node sends messages to its neighbors summarizing what it knows. After two passes (leaves to root, root to leaves), every node has the exact marginal posterior.

A message from node i to node j says: "Based on everything I've seen on my side of the tree, here's what I believe about your state." It's local computation with global results.

m_i→j(x_j) = Σ_{x_i} φ(x_i, x_j) · Π_k≠j m_k→i(x_i)

Message Passing Animation

Watch messages flow through a tree. Click "Send Messages" to see each round of propagation. After two passes, all beliefs are exact.

Round 0: No messages sent

Loopy BP: For networks with loops, we can still run message passing — it's not guaranteed to converge or be exact, but in practice it often works well. This is the basis of many practical inference systems.

In implementation, each message is just an array of floats — one entry per state of the receiving node. For binary variables, every message is a [2] vector. A node computes its belief by multiplying all incoming messages element-wise and normalizing. For our 7-node tree, that's 12 messages (one per edge direction), each of size 2. Total computation: 24 multiplications + 7 normalizations. Compare that to the full joint: 2⁷ = 128 entries.

Check: When is belief propagation exact?

Always On tree-structured graphs (no undirected loops) Never — it's always approximate

🔗 Pattern Recognition

HMMs Are Just Chain-Structured Bayes Nets

This Lesson (Bayes Net)

Belief propagation on a tree: messages flow from leaves to root and back. Each message m_i→j(x_j) = Σ_{x_i} φ(x_i,x_j) Π_k m_k→i(x_i)

HMM (Forward-Backward)

Forward variable: α_t(j) = Σ_i α_t-1(i) · a_ij · b_j(o_t). Backward: β_t(i) = Σ_j a_ij · b_j(o_t+1) · β_t+1(j) → HMM Lesson

The HMM forward-backward algorithm IS belief propagation on a chain. The "forward" pass sends messages left-to-right (leaf → root). The "backward" pass sends right-to-left (root → leaves). The "message" at each step is exactly the α or β vector. The transition matrix A plays the role of φ, and the emission probability b plays the role of the local evidence. If you understand BP, you already understand forward-backward — it's the same algorithm on a specific graph topology.

Can you see the analogy further? What would "loopy BP" mean for an HMM? (Hint: think about HMMs with skip connections or factorial HMMs.)

Chapter 7: Interactive DAG Builder

Build your own Bayesian network! Click to add nodes, drag between nodes to add edges. Set CPTs and run inference queries. This is the full SLAM experience — but for probability.

DAG Builder

Click canvas to add nodes. Switch to edge mode to connect them.

Tips: Keep it simple — 3-5 nodes is enough to see interesting behavior. Try building the classic "Asia" network (visit to Asia, tuberculosis, smoking, lung cancer, bronchitis, X-ray, dyspnea).

Chapter 8: Learning Structure

So far we've assumed the graph structure is given. But what if you have data and need to discover the dependencies? This is structure learning — arguably the hardest problem in Bayesian networks.

There are two main approaches: score-based (search over possible graphs, score each one) and constraint-based (test conditional independencies in the data, build the graph from the results).

Score-Based (e.g., BIC, BDeu)

Search over DAGs, score each by how well it fits the data while penalizing complexity. Like model selection.

vs.

Constraint-Based (e.g., PC algorithm)

Run independence tests on data. If X ⊥ Y | Z, remove the edge. Determines skeleton, then orient edges.

Structure Learning Demo

Generate data from a hidden network. Click "Learn Structure" to watch the PC algorithm discover the DAG by testing conditional independencies.

Click Learn Step to begin

Caution: Data alone can't distinguish between some graph structures (Markov equivalence classes). A → B and A ← B encode the same independencies! You need interventional data or prior knowledge to resolve these ambiguities.

Check: Why can't we always uniquely learn the DAG from observational data?

Different DAGs can encode the same conditional independencies (Markov equivalence) Computers aren't powerful enough We need at least 1 million data points

Chapter 9: Connections & Beyond

Bayesian networks are everywhere. An HMM (Hidden Markov Model) is a special case — a chain-structured Bayes net where hidden states emit observations. A Kalman filter is a continuous Gaussian Bayes net with linear dynamics.

Model	Structure	Variables	Key Algorithm
Bayes Net	General DAG	Discrete/continuous	Variable elimination
HMM	Chain	Discrete hidden	Forward-backward
Kalman Filter	Chain	Continuous Gaussian	Predict-update
MRF/CRF	Undirected	Any	Belief propagation
Causal Model	DAG + interventions	Any	do-calculus

Related microLessons: Bayes Filter is the recursive inference algorithm for chain-structured Bayes nets. Bayes Estimation covers the continuous-variable case. HMMs are a special case studied in detail.

Implementation: A Bayes net is just a DAG + CPTs. In code, that's a dictionary of nodes, each storing its parents and its conditional probability table:

python
# A Bayes net = DAG + CPTs. That's it.
network = {
    'Rain':      { 'parents': [],
                   'cpt': [0.2, 0.8] },
    'Sprinkler': { 'parents': ['Rain'],
                   'cpt': [[0.4, 0.6], [0.01, 0.99]] },
    'Wet':       { 'parents': ['Sprinkler', 'Rain'],
                   'cpt': [[[0.99,0.01],[0.8,0.2]],
                           [[0.9,0.1],[0.01,0.99]]] }
}
# Inference = message passing on this data structure
# To query P(Rain|Wet=T): variable elimination over this dict

What breaks at scale: Exact inference in general Bayes nets is NP-hard (for multiply-connected / loopy networks, the intermediate factors in variable elimination can grow exponentially). For networks with >50 densely connected nodes, exact methods are hopeless. Approximate methods step in: MCMC (sample from the posterior by random walks), variational inference (fit a tractable approximation), or loopy belief propagation (run message passing despite loops — not guaranteed to converge, but often works).

This computational wall is a key reason neural networks replaced Bayes nets for perception tasks. A Bayes net for image classification would need one node per pixel, with hand-designed dependencies between neighboring pixels — thousands of nodes, each with CPTs you'd somehow need to specify. A neural network learns these dependencies automatically via gradient descent, scaling to millions of parameters without anyone specifying a single conditional probability. The tradeoff: neural networks are black boxes, while Bayes nets give you interpretable, auditable reasoning.

🏗 Design Challenge You're the Architect: Bedside Diagnostic Bayesian Network ▶ ✓ ATTEMPTED

You're building a Bayesian network for a bedside diagnostic device in a rural clinic. The system must reason about 10 diseases from 30 observable symptoms (binary: present/absent). Doctors enter symptoms on a tablet and get ranked differential diagnoses. The device has limited compute (Raspberry Pi 4, 4GB RAM) and must return results in under 2 seconds.

Diseases

10 (binary: present/absent)

Symptoms

30 (binary observations)

Compute

Raspberry Pi 4, 4GB RAM

Latency

< 2 seconds for full posterior

Missing data

Typically 5-10 symptoms unobserved

Comorbidity

Patient may have 0-3 diseases simultaneously

1. Graph structure: Should diseases be root nodes with symptoms as children (naive Bayes style), or should you allow disease-disease edges (comorbidity) and symptom-symptom edges (symptom cascades)?

2. CPT representation: With 10 potential disease-parents per symptom, a full CPT would have 2¹⁰ = 1024 entries per symptom. How do you make this tractable?

3. Inference algorithm: Given your graph structure choice, is exact inference feasible? If not, what approximate method fits the latency budget?

4. Missing symptoms: 5-10 symptoms will be unobserved. How does your inference handle marginalizing over missing data efficiently?

Real-world solution (QMR-DT, Internist-1/QMR):

Structure: Two-layer bipartite graph. Diseases as roots, symptoms as leaves. NO disease-disease edges (comorbidity handled by the model assuming multiple diseases can be present simultaneously). NO symptom-symptom edges (keeps treewidth low).

CPT compression via Noisy-OR: Instead of 2^k entries per symptom, use the noisy-OR assumption: each disease independently has a probability of causing the symptom, and the symptom appears if ANY cause fires. P(S=F | D₁,...,D_k) = Π_{i: D_i=T} (1 - q_i). This reduces 1024 parameters to just 10 (one q_i per disease). Total network parameters: 10 priors + 30×10 = 310 instead of 30×1024 = 30,720.

Inference: With noisy-OR and the bipartite structure, exact inference via variable elimination is feasible because the effective treewidth is bounded by the number of active diseases (typically 1-3). For the worst case (all 10 diseases), use likelihood weighting (a simple Monte Carlo method) with 1000 samples — takes ~50ms on a Pi 4.

Missing data: Unobserved symptoms are simply not set as evidence — variable elimination naturally marginalizes over them. This is one of Bayes nets' biggest advantages: missing data requires no special handling.

🔗 Pattern Recognition

From Correlation to Causation: Bayes Nets + do-calculus

Bayes Net (This Lesson)

P(Y | X=x) — "What is Y if we OBSERVE X=x?" Condition on evidence. Information flows through all active paths including colliders.

Causal Model (do-calculus)

P(Y | do(X=x)) — "What is Y if we INTERVENE to set X=x?" Cut all incoming edges to X, then compute. Blocks backdoor paths.

A Bayesian network encodes correlational structure: "these variables tend to co-occur." A causal model adds one crucial distinction: the difference between observing and intervening. Observing X=x means conditioning — all paths remain. Intervening (do(X=x)) means forcing X to a value, which severs X from its parents (you delete all incoming edges to X). Same graph, profoundly different semantics. Pearl's do-calculus provides rules for when P(Y|do(X)) can be computed from observational data alone — and it relies entirely on the d-separation machinery you just learned.

If you observe that people who carry lighters have higher lung cancer rates, the Bayes net says P(Cancer | Lighter=yes) > P(Cancer). But does GIVING someone a lighter cause cancer? What edges would you cut?

"Probability theory is nothing but common sense reduced to calculation."

— Pierre-Simon Laplace

You now understand the language of probabilistic reasoning with graphs. From medical diagnosis to robot perception, Bayesian networks turn uncertain evidence into principled decisions.

Understand Bayesian
Networks

Chapter 0: Variables with Dependencies

Chapter 1: DAGs — Directed Acyclic Graphs

Chapter 2: Conditional Probability Tables

Chapter 3: Conditional Independence

Chapter 4: D-Separation

Chapter 5: Exact Inference — Variable Elimination

Chapter 6: Belief Propagation

Chapter 7: Interactive DAG Builder

Chapter 8: Learning Structure

Chapter 9: Connections & Beyond

Understand BayesianNetworks

Chapter 0: Variables with Dependencies

Chapter 1: DAGs — Directed Acyclic Graphs

Chapter 2: Conditional Probability Tables

Chapter 3: Conditional Independence

Chapter 4: D-Separation

Chapter 5: Exact Inference — Variable Elimination

Chapter 6: Belief Propagation

Chapter 7: Interactive DAG Builder

Chapter 8: Learning Structure

Chapter 9: Connections & Beyond

Understand Bayesian
Networks