Bishop PRML, Chapter 8

Graphical Models

Bayesian networks, Markov random fields, d-separation, factor graphs, and belief propagation — the language for reasoning about complex probabilistic models.

Prerequisites: Chapters 1–2 (probability theory, distributions).
12
Chapters
2
Simulations
12
Quizzes

Chapter 0: Why Graphs?

A joint distribution over D variables has exponentially many parameters. If each variable takes K values, a general joint distribution needs KD − 1 numbers. For 30 binary variables, that's over a billion parameters.

In practice, most variables interact only with a few others. Your umbrella decision depends on whether it rains, not on the stock market. Graphical models exploit this sparsity by representing the joint distribution as a product of simpler factors, one for each local interaction.

Three roles of graphical models:
1. Visualization: The graph reveals the structure of a probabilistic model at a glance.
2. Factorization: The graph tells us how the joint distribution decomposes into local factors — reducing exponential complexity to manageable size.
3. Inference: The graph structure determines which inference algorithms can be applied efficiently.

Two main types of graphical models: directed graphs (Bayesian networks) with arrows encoding causal or generative relationships, and undirected graphs (Markov random fields) with edges encoding symmetric correlations.

Check: Why are graphical models needed?

Chapter 1: Bayesian Networks

A Bayesian network (directed acyclic graph, DAG) represents a joint distribution as a product of conditional distributions, one per node:

p(x1, ..., xD) = ∏i=1D p(xi | pa(xi))

where pa(xi) denotes the parents of node xi in the graph. Each node stores only its conditional distribution given its parents — a dramatic reduction in parameters.

Example: Consider three variables: Battery (B), Fuel (F), and Engine starts (E). The engine depends on both battery and fuel: p(B,F,E) = p(B) p(F) p(E|B,F). The graph has arrows B → E and F → E. We need 1 + 1 + 4 = 6 parameters instead of 23 − 1 = 7 for the full joint. The savings grow exponentially with the number of variables.

The key constraint: the graph must be acyclic (a DAG). This ensures we can always compute the joint distribution by processing nodes in topological order. Cycles would create circular dependencies.

Every probabilistic model we've seen so far can be drawn as a Bayesian network: polynomial regression (Ch 3), mixture models (Ch 9), hidden Markov models (Ch 13).

Check: In a Bayesian network, the joint distribution factorizes as:

Chapter 2: Factorization & Parameter Savings

How many parameters does a Bayesian network save? Consider D binary variables in a fully connected model: 2D − 1 parameters. In a Bayesian network where each node has at most K parents: D · 2K parameters — linear in D.

Variables (D)Full jointBN (K=2 parents max)Savings
101,0234025×
201,048,5758013,107×
301,073,741,8231208,947,848×
Generative models as Bayesian networks: A generative model for class-conditional densities p(x|Ck) paired with class priors p(Ck) has a natural BN: C → x1, C → x2, ..., C → xD. If we add the naive Bayes assumption (features conditionally independent given the class), each feature depends only on C. The graph reveals the assumption: no edges between features.

Conversely, discriminative models like logistic regression model p(C|x) directly. In the graphical model, the arrows point x1 → C, x2 → C, etc. Discriminative models often have fewer parameters and can be more accurate, but they can't generate new data.

Check: What assumption does naive Bayes make, as revealed by its graph?

Chapter 3: Discrete & Gaussian Models

Two important special cases of Bayesian networks have closed-form inference:

Discrete variables: Each node takes one of K values. The conditional distribution p(xi|pa(xi)) is stored as a conditional probability table (CPT). For a node with M parents each taking K values, the CPT has KM(K−1) free parameters. The "explaining away" phenomenon (see d-separation) arises naturally.

Linear-Gaussian models: Each node has a Gaussian distribution whose mean is a linear function of its parents:

p(xi | pa(xi)) = N(xi | ∑j ∈ pa(i) wij xj + bi, vi)
Linear-Gaussian = Gaussian: Any Bayesian network with linear-Gaussian conditionals defines a joint Gaussian distribution over all variables. The mean vector and covariance matrix can be computed recursively from the graph structure. This gives a graphical interpretation of the multivariate Gaussian — and shows why Gaussians are so convenient: the graphical model and the closed-form distribution align perfectly.

Factor analysis (Ch 12), Kalman filters (Ch 13), and probabilistic PCA are all linear-Gaussian models expressed as Bayesian networks.

Check: What kind of joint distribution does a linear-Gaussian Bayesian network define?

Chapter 4: Conditional Independence

The most important property revealed by graphical models is conditional independence. If A ⊥ B | C, then knowing C makes A and B independent: p(A, B | C) = p(A|C) p(B|C).

Three canonical structures in directed graphs determine the flow of information:

StructurePatternC observed?A ⊥ B?
ChainA → C → BNoDependent
YesIndependent
ForkA ← C → BNoDependent
YesIndependent
ColliderA → C ← BNoIndependent
YesDependent!
Explaining away (the collider): This is the surprising case. If two causes (A, B) share a common effect (C), they're independent a priori. But once you observe the effect, they become dependent — knowing one cause "explains away" the other. Example: a car won't start (C). If we know the battery is dead (A), fuel problems (B) become less likely. Observing the effect creates a dependency between the causes.
Check: In a collider structure A → C ← B, what happens when C is observed?

Chapter 5: D-Separation

D-separation (directed separation) is the general algorithm for reading conditional independencies from a directed graph. It determines whether sets of nodes A and B are conditionally independent given set C.

Consider all paths between a node in A and a node in B. A path is blocked by C if:

1. The path contains a chain (... → z → ...) or fork (... ← z → ...) where z ∈ C (observed middle node blocks flow), OR

2. The path contains a collider (... → z ← ...) where z ∉ C and no descendant of z is in C (unobserved collider blocks flow).

D-separation rule: A and B are conditionally independent given C (written A ⊥G B | C) if and only if every path between A and B is blocked by C. If even one path is unblocked, they are (potentially) dependent.

The Markov blanket of a node x consists of its parents, children, and co-parents (other parents of its children). Given its Markov blanket, x is conditionally independent of all other nodes in the graph. This is the minimal set of nodes that "shields" x from the rest of the network.

Check: What is the Markov blanket of a node?

Chapter 6: Markov Random Fields

Markov random fields (MRFs, undirected graphical models) use undirected edges to represent symmetric relationships. The conditional independence structure is simpler: xA ⊥ xB | xC if C separates A from B in the graph (just remove C and check if A and B are still connected).

The joint distribution factorizes over cliques (maximal fully connected subgraphs):

p(x) = (1/Z) ∏c ψc(xc)

where ψc are non-negative potential functions (not probabilities!) and Z = ∑xc ψc(xc) is the partition function (normalization constant).

Directed vs undirected: Bayesian networks factorize into conditional probabilities (always normalized). MRFs factorize into potential functions (need separate normalization via Z). Computing Z is often intractable — a sum over all configurations. This is both the weakness and the strength of MRFs: the unnormalized potentials are more flexible, but the partition function is harder to compute.

Application: Image denoising. An MRF for images treats each pixel as a node, with edges between neighboring pixels. The potential ψ(xi, xj) encourages neighboring pixels to have similar values (smoothness prior). The observed noisy image provides a data term. MAP inference finds the denoised image.

Check: What makes computing with MRFs harder than Bayesian networks?

Chapter 7: Factor Graphs

Factor graphs unify directed and undirected models. They make the factorization structure explicit by introducing factor nodes alongside variable nodes.

A factor graph is a bipartite graph with two types of nodes:

Variable nodes (circles): Represent random variables xi.

Factor nodes (squares): Represent factors fs(xs) in the joint distribution.

Edges connect each factor to the variables it depends on. The joint distribution is:

p(x) = ∏s fs(xs)
Why factor graphs? Both Bayesian networks and MRFs can be converted to factor graphs, but factor graphs are more explicit about the factorization. Two different graphical models that yield the same clique decomposition (and hence the same undirected graph) may have different factor graphs that distinguish the underlying factorizations. This matters for inference: the factor graph structure directly determines which message-passing algorithms work.

Factor graphs on trees (no loops) allow exact inference via message passing. On graphs with loops, we can use loopy belief propagation (approximate) or convert to a junction tree (exact but potentially exponential).

Check: What are the two types of nodes in a factor graph?

Chapter 8: The Sum-Product Algorithm

The sum-product algorithm (belief propagation) computes exact marginal distributions on tree-structured factor graphs by passing messages between nodes.

Two types of messages:

μf→x(x) = ∑xs\x fs(xs) ∏xi ∈ ne(f)\x μxi→f(xi)
μx→f(x) = ∏fj ∈ ne(x)\f μfj→x(x)

The first message: a factor sends to a variable by summing over all other variables in the factor, weighted by incoming messages from those variables. The second: a variable sends to a factor by multiplying all incoming messages from other factors.

Local computation, global result: Each message is computed from local information only (the factor and its neighbors). Yet after a single pass from leaves to root and back, every node has its exact marginal distribution: p(x) ∝ ∏f ∈ ne(x) μf→x(x). This is the miracle of message passing — global inference from local computation. It works because the tree structure ensures each message is independent.

Computational cost: O(NKM) where N is the number of variables, K is the number of states per variable, and M is the maximum factor size. On a chain, this is O(NK2) — the forward-backward algorithm from Ch 13.

Check: What does the sum-product algorithm compute?

Chapter 9: The Max-Sum Algorithm

The sum-product algorithm computes marginals. But sometimes we want the most probable configuration x* = arg maxx p(x). The max-sum algorithm does this.

The idea: replace sums with maxes and products with sums (by working in log space):

μf→x(x) = maxxs\x [ln fs(xs) + ∑xi ∈ ne(f)\x μxi→f(xi)]

The max replaces the sum; addition replaces multiplication (in log space). The backtrack step recovers which configuration achieved the maximum.

Max-sum is the Viterbi algorithm: Applied to a hidden Markov model (a chain-structured factor graph), the max-sum algorithm reduces exactly to the Viterbi algorithm from Ch 13. Applied to a tree-structured CRF, it gives the MAP labeling. The graphical model framework unifies algorithms that were developed independently in different communities.

For graphs with loops, neither exact algorithm applies directly. Options include:

1. Loopy belief propagation: Run sum-product/max-sum on the loopy graph anyway. Not guaranteed to converge, but often works well in practice.

2. Junction tree: Convert to a tree by clustering nodes. Exact, but the cluster sizes (and hence cost) depend on the graph's treewidth.

Check: What is the relationship between max-sum and the Viterbi algorithm?

Chapter 10: Graph Explorer

Explore the three canonical connection types and how observation (conditioning) affects information flow.

D-Separation Explorer

Click on nodes to observe/unobserve them. Observed nodes turn green. The status shows whether A and B are d-separated (conditionally independent) given observed nodes.

Click a structure
Check: In a fork A ← C → B, are A and B independent when C is NOT observed?

Chapter 11: Summary

Chapter 8 established the language for reasoning about complex probabilistic models:

ConceptPurpose
Bayesian networks (DAGs)Factorize joint distributions using conditional probabilities
D-separationRead conditional independencies from directed graphs
Markov random fieldsModel symmetric relationships via potential functions
Factor graphsUnify directed/undirected models; support message passing
Sum-productCompute exact marginals on trees
Max-sumFind MAP configuration on trees
The graphical model as a universal language: Nearly every model in this book — linear regression, mixture models, HMMs, Kalman filters, Boltzmann machines — can be expressed as a graphical model. The graph reveals the model's assumptions (conditional independencies), determines which inference algorithms apply, and guides the design of new models. Chapters 9–14 build heavily on this framework.

What comes next: Chapter 9 introduces mixture models and EM — using latent variables (naturally represented in graphical models) to create flexible distributions, and the EM algorithm to learn them.

"A graph comprises nodes (also called vertices) connected
by links (also known as edges or arcs). In a probabilistic
graphical model, each node represents a random variable."
— Christopher Bishop, PRML §8
Check: What unifying framework does Chapter 8 provide for all probabilistic models?