The Complete Beginner's Path

Understand Bayesian
Networks

The framework that lets machines reason about cause and effect under uncertainty — from medical diagnosis to spam filtering to gene regulatory networks.

Prerequisites: Basic probability + Intuition for graphs. That's it.
10
Chapters
7+
Simulations
0
Assumed Knowledge

Chapter 0: Variables with Dependencies

It's raining outside. The grass is wet. Are these facts independent? Obviously not — rain causes wet grass. But the sprinkler could also cause wet grass. And maybe the season affects both rain and sprinkler usage. Events in the real world form a tangled web of dependencies.

A Bayesian network represents this web compactly. Instead of specifying a probability for every possible combination of events (which grows exponentially), we only specify local dependencies — how each variable relates to its direct causes.

The core idea: Don't specify the full joint distribution. Instead, factor it into small, local conditional probability tables. The graph structure tells you which variables depend on which. Everything else follows from the chain rule of probability.
The Rain-Sprinkler-Grass Network

Click nodes to toggle their state (true/false). Watch how evidence propagates through the network.

Check: Why don't we just list all combinations of variable states?

Chapter 1: DAGs — Directed Acyclic Graphs

A Bayesian network is a DAG: a Directed Acyclic Graph. Each node is a random variable. Each directed edge (arrow) means "this variable directly influences that one." The "acyclic" part means no loops — you can't follow arrows and end up where you started.

The graph encodes the factorization of the joint probability. For variables X1, ..., Xn with parents Pa(Xi):

P(X1, ..., Xn) = Πi P(Xi | Pa(Xi))

This factorization is why Bayes nets are efficient. Instead of one giant table with 2n entries, we have n small tables, each conditioned on only a few parents.

DAG Terminology

Hover over nodes to see their parents, children, and Markov blanket highlighted.

Markov blanket: A node is conditionally independent of ALL other nodes given its Markov blanket (parents + children + children's other parents). This is the minimal set you need to predict a node.
Check: What does the "acyclic" constraint mean?

Chapter 2: Conditional Probability Tables

Each node stores a CPT (Conditional Probability Table). For a root node (no parents), it's just a prior: P(Rain) = 0.2. For a node with parents, it specifies the probability for each combination of parent states: P(Wet Grass | Rain=T, Sprinkler=T) = 0.99.

The CPT is where domain knowledge lives. A doctor building a medical Bayes net fills in CPTs like P(Cough | Flu=yes, Smoker=yes) = 0.85 based on clinical experience or data.

Interactive CPT Explorer

Click nodes to see their CPT. Adjust probabilities with the sliders below.

Click a node to view its CPT
RainSprinklerP(Wet=T)
FF0.01
FT0.90
TF0.80
TT0.99
Size matters: A node with k binary parents needs a CPT with 2k rows. This is why Bayes nets work best when each node has few parents. The graph structure keeps CPTs small.
Check: How many rows does a CPT have for a binary node with 3 binary parents?

Chapter 3: Conditional Independence

The whole point of a Bayes net is to encode conditional independencies. Two variables X and Y are conditionally independent given Z (written X ⊥ Y | Z) if knowing Z makes X irrelevant for predicting Y.

Example: Does knowing the season help you predict if the grass is wet? Yes. But if you already know whether it rained AND whether the sprinkler was on, then season adds nothing. Grass is conditionally independent of Season given {Rain, Sprinkler}.

Why it matters: Conditional independence is what makes inference tractable. Without it, we'd need to consider all variables simultaneously. With it, we can reason locally and propagate information through the graph.
Independence Tester

Select two variables and a conditioning set. The display shows whether they're conditionally independent in this network.

X ⊥ Y | Z  ⇔  P(X | Y, Z) = P(X | Z)
Check: X ⊥ Y | Z means...

Chapter 4: D-Separation

How do you read off conditional independencies from the graph structure? The answer is d-separation. It's a purely graphical criterion: check if information can flow between two nodes along a path, given what you've observed.

There are three fundamental structures. Each behaves differently when the middle node is observed vs. unobserved:

Chain: A → B → C
B blocks the path when observed. (Knowing the mediator screens off the cause from the effect.)
Fork: A ← B → C
B blocks when observed. (Knowing the common cause makes the effects independent.)
Collider: A → B ← C
B blocks when NOT observed. Observing B OPENS the path! (Explaining away.)
D-Separation Visualizer

Click the middle node to toggle observing it. Watch how information flow (green = flows, red = blocked) changes for each structure.

The collider is the tricky one. Unobserved, it blocks flow (the causes are independent). But once you observe the collider or any of its descendants, it opens the path. This is called "explaining away" — if you know the grass is wet and it didn't rain, the sprinkler probably was on.
Check: In A → B ← C (collider), when does observing B affect A-C independence?

Chapter 5: Exact Inference — Variable Elimination

Given evidence (some variables observed), we want to compute the posterior probability of a query variable. The brute-force way: sum over all combinations of unobserved variables. But that's exponential. Variable elimination does it smarter by summing out variables one at a time, exploiting the factored structure.

1. Choose elimination order
Pick a variable to sum out
2. Multiply relevant factors
Combine all CPTs involving this variable
3. Sum out the variable
Marginalize: ΣX f(X, ...)
4. Repeat until only query remains
Normalize to get P(query | evidence)
Variable Elimination Step-Through

Query: P(Rain | Wet=true). Click "Next Step" to eliminate variables one at a time. Watch the factors combine and shrink.

Step 0: Initial factors
The elimination order matters! A bad order can create huge intermediate factors. Finding the optimal order is NP-hard, but good heuristics (min-degree, min-fill) work well in practice.
Check: What does variable elimination do differently from brute-force summation?

Chapter 6: Belief Propagation

For tree-structured networks (no undirected loops), there's an elegant alternative: belief propagation (the sum-product algorithm). Each node sends messages to its neighbors summarizing what it knows. After two passes (leaves to root, root to leaves), every node has the exact marginal posterior.

A message from node i to node j says: "Based on everything I've seen on my side of the tree, here's what I believe about your state." It's local computation with global results.

mi→j(xj) = Σxi φ(xi, xj) · Πk≠j mk→i(xi)
Message Passing Animation

Watch messages flow through a tree. Click "Send Messages" to see each round of propagation. After two passes, all beliefs are exact.

Round 0: No messages sent
Loopy BP: For networks with loops, we can still run message passing — it's not guaranteed to converge or be exact, but in practice it often works well. This is the basis of many practical inference systems.
Check: When is belief propagation exact?

Chapter 7: Interactive DAG Builder

Build your own Bayesian network! Click to add nodes, drag between nodes to add edges. Set CPTs and run inference queries. This is the full SLAM experience — but for probability.

DAG Builder
Click canvas to add nodes. Switch to edge mode to connect them.
Tips: Keep it simple — 3-5 nodes is enough to see interesting behavior. Try building the classic "Asia" network (visit to Asia, tuberculosis, smoking, lung cancer, bronchitis, X-ray, dyspnea).

Chapter 8: Learning Structure

So far we've assumed the graph structure is given. But what if you have data and need to discover the dependencies? This is structure learning — arguably the hardest problem in Bayesian networks.

There are two main approaches: score-based (search over possible graphs, score each one) and constraint-based (test conditional independencies in the data, build the graph from the results).

Score-Based (e.g., BIC, BDeu)
Search over DAGs, score each by how well it fits the data while penalizing complexity. Like model selection.
vs.
Constraint-Based (e.g., PC algorithm)
Run independence tests on data. If X ⊥ Y | Z, remove the edge. Determines skeleton, then orient edges.
Structure Learning Demo

Generate data from a hidden network. Click "Learn Structure" to watch the PC algorithm discover the DAG by testing conditional independencies.

Click Learn Step to begin
Caution: Data alone can't distinguish between some graph structures (Markov equivalence classes). A → B and A ← B encode the same independencies! You need interventional data or prior knowledge to resolve these ambiguities.
Check: Why can't we always uniquely learn the DAG from observational data?

Chapter 9: Connections & Beyond

Bayesian networks are everywhere. An HMM (Hidden Markov Model) is a special case — a chain-structured Bayes net where hidden states emit observations. A Kalman filter is a continuous Gaussian Bayes net with linear dynamics.

ModelStructureVariablesKey Algorithm
Bayes NetGeneral DAGDiscrete/continuousVariable elimination
HMMChainDiscrete hiddenForward-backward
Kalman FilterChainContinuous GaussianPredict-update
MRF/CRFUndirectedAnyBelief propagation
Causal ModelDAG + interventionsAnydo-calculus
Related microLessons: Bayes Filter is the recursive inference algorithm for chain-structured Bayes nets. Bayes Estimation covers the continuous-variable case. HMMs are a special case studied in detail.
Modern extensions: Variational inference (approximate methods for large nets), causal inference (Pearl's do-calculus for interventions), deep probabilistic models (VAEs as implicit Bayes nets), and structure learning with neural networks.
"Probability theory is nothing but common sense reduced to calculation."
— Pierre-Simon Laplace

You now understand the language of probabilistic reasoning with graphs. From medical diagnosis to robot perception, Bayesian networks turn uncertain evidence into principled decisions.