Bayesian networks, Markov random fields, d-separation, factor graphs, and belief propagation — the language for reasoning about complex probabilistic models.
A joint distribution over D variables has exponentially many parameters. If each variable takes K values, a general joint distribution needs KD − 1 numbers. For 30 binary variables, that's over a billion parameters.
In practice, most variables interact only with a few others. Your umbrella decision depends on whether it rains, not on the stock market. Graphical models exploit this sparsity by representing the joint distribution as a product of simpler factors, one for each local interaction.
Two main types of graphical models: directed graphs (Bayesian networks) with arrows encoding causal or generative relationships, and undirected graphs (Markov random fields) with edges encoding symmetric correlations.
A Bayesian network (directed acyclic graph, DAG) represents a joint distribution as a product of conditional distributions, one per node:
where pa(xi) denotes the parents of node xi in the graph. Each node stores only its conditional distribution given its parents — a dramatic reduction in parameters.
The key constraint: the graph must be acyclic (a DAG). This ensures we can always compute the joint distribution by processing nodes in topological order. Cycles would create circular dependencies.
Every probabilistic model we've seen so far can be drawn as a Bayesian network: polynomial regression (Ch 3), mixture models (Ch 9), hidden Markov models (Ch 13).
How many parameters does a Bayesian network save? Consider D binary variables in a fully connected model: 2D − 1 parameters. In a Bayesian network where each node has at most K parents: D · 2K parameters — linear in D.
| Variables (D) | Full joint | BN (K=2 parents max) | Savings |
|---|---|---|---|
| 10 | 1,023 | 40 | 25× |
| 20 | 1,048,575 | 80 | 13,107× |
| 30 | 1,073,741,823 | 120 | 8,947,848× |
Conversely, discriminative models like logistic regression model p(C|x) directly. In the graphical model, the arrows point x1 → C, x2 → C, etc. Discriminative models often have fewer parameters and can be more accurate, but they can't generate new data.
Two important special cases of Bayesian networks have closed-form inference:
Discrete variables: Each node takes one of K values. The conditional distribution p(xi|pa(xi)) is stored as a conditional probability table (CPT). For a node with M parents each taking K values, the CPT has KM(K−1) free parameters. The "explaining away" phenomenon (see d-separation) arises naturally.
Linear-Gaussian models: Each node has a Gaussian distribution whose mean is a linear function of its parents:
Factor analysis (Ch 12), Kalman filters (Ch 13), and probabilistic PCA are all linear-Gaussian models expressed as Bayesian networks.
The most important property revealed by graphical models is conditional independence. If A ⊥ B | C, then knowing C makes A and B independent: p(A, B | C) = p(A|C) p(B|C).
Three canonical structures in directed graphs determine the flow of information:
| Structure | Pattern | C observed? | A ⊥ B? |
|---|---|---|---|
| Chain | A → C → B | No | Dependent |
| Yes | Independent | ||
| Fork | A ← C → B | No | Dependent |
| Yes | Independent | ||
| Collider | A → C ← B | No | Independent |
| Yes | Dependent! |
D-separation (directed separation) is the general algorithm for reading conditional independencies from a directed graph. It determines whether sets of nodes A and B are conditionally independent given set C.
Consider all paths between a node in A and a node in B. A path is blocked by C if:
1. The path contains a chain (... → z → ...) or fork (... ← z → ...) where z ∈ C (observed middle node blocks flow), OR
2. The path contains a collider (... → z ← ...) where z ∉ C and no descendant of z is in C (unobserved collider blocks flow).
The Markov blanket of a node x consists of its parents, children, and co-parents (other parents of its children). Given its Markov blanket, x is conditionally independent of all other nodes in the graph. This is the minimal set of nodes that "shields" x from the rest of the network.
Markov random fields (MRFs, undirected graphical models) use undirected edges to represent symmetric relationships. The conditional independence structure is simpler: xA ⊥ xB | xC if C separates A from B in the graph (just remove C and check if A and B are still connected).
The joint distribution factorizes over cliques (maximal fully connected subgraphs):
where ψc are non-negative potential functions (not probabilities!) and Z = ∑x ∏c ψc(xc) is the partition function (normalization constant).
Application: Image denoising. An MRF for images treats each pixel as a node, with edges between neighboring pixels. The potential ψ(xi, xj) encourages neighboring pixels to have similar values (smoothness prior). The observed noisy image provides a data term. MAP inference finds the denoised image.
Factor graphs unify directed and undirected models. They make the factorization structure explicit by introducing factor nodes alongside variable nodes.
A factor graph is a bipartite graph with two types of nodes:
Variable nodes (circles): Represent random variables xi.
Factor nodes (squares): Represent factors fs(xs) in the joint distribution.
Edges connect each factor to the variables it depends on. The joint distribution is:
Factor graphs on trees (no loops) allow exact inference via message passing. On graphs with loops, we can use loopy belief propagation (approximate) or convert to a junction tree (exact but potentially exponential).
The sum-product algorithm (belief propagation) computes exact marginal distributions on tree-structured factor graphs by passing messages between nodes.
Two types of messages:
The first message: a factor sends to a variable by summing over all other variables in the factor, weighted by incoming messages from those variables. The second: a variable sends to a factor by multiplying all incoming messages from other factors.
Computational cost: O(NKM) where N is the number of variables, K is the number of states per variable, and M is the maximum factor size. On a chain, this is O(NK2) — the forward-backward algorithm from Ch 13.
The sum-product algorithm computes marginals. But sometimes we want the most probable configuration x* = arg maxx p(x). The max-sum algorithm does this.
The idea: replace sums with maxes and products with sums (by working in log space):
The max replaces the sum; addition replaces multiplication (in log space). The backtrack step recovers which configuration achieved the maximum.
For graphs with loops, neither exact algorithm applies directly. Options include:
1. Loopy belief propagation: Run sum-product/max-sum on the loopy graph anyway. Not guaranteed to converge, but often works well in practice.
2. Junction tree: Convert to a tree by clustering nodes. Exact, but the cluster sizes (and hence cost) depend on the graph's treewidth.
Explore the three canonical connection types and how observation (conditioning) affects information flow.
Click on nodes to observe/unobserve them. Observed nodes turn green. The status shows whether A and B are d-separated (conditionally independent) given observed nodes.
Chapter 8 established the language for reasoning about complex probabilistic models:
| Concept | Purpose |
|---|---|
| Bayesian networks (DAGs) | Factorize joint distributions using conditional probabilities |
| D-separation | Read conditional independencies from directed graphs |
| Markov random fields | Model symmetric relationships via potential functions |
| Factor graphs | Unify directed/undirected models; support message passing |
| Sum-product | Compute exact marginals on trees |
| Max-sum | Find MAP configuration on trees |
What comes next: Chapter 9 introduces mixture models and EM — using latent variables (naturally represented in graphical models) to create flexible distributions, and the EM algorithm to learn them.