Kochenderfer, Wheeler & Wray, Chapter 5

Structure Learning

You have data but no graph. This chapter builds Bayesian network structure from scratch — scoring how well every possible graph explains the data, then searching a super-exponential space to find the best one.

Prerequisites: Chapter 4 (Parameter Learning) — Bayesian scores, Dirichlet priors, counting from data.
10
Chapters
6+
Simulations
10
Quizzes

Chapter 0: The Structure Problem

Chapter 4 assumed you already knew the Bayesian network structure — which variables are parents of which. But in real applications, you rarely have that luxury. You have a spreadsheet of observations and the question: what graph produced this data?

This is structure learning: finding a directed acyclic graph (DAG) G that best explains the observed dataset D. It seems straightforward until you count the search space. With 10 variables there are 4.2 × 1018 possible DAGs. With 20 variables: 2.3 × 1072. No computer can enumerate them. We need smarter strategies.

The core challenge has two parts: (1) how do we score a graph to decide if it is good? (2) how do we search the space of graphs efficiently? This chapter answers both.

Structure learning is NP-hard. Finding the optimal Bayesian network structure from data is provably NP-hard. No polynomial-time algorithm exists unless P=NP. In practice, we use heuristic search algorithms that find good — but not necessarily optimal — structures.
The Combinatorial Explosion

Use the slider to see how the count of possible DAGs (Robinson's formula) grows with the number of nodes. Even 6 nodes have over 3.7 million possible structures.

Number of nodes 5

How Many DAGs Exist? Robinson's Formula

The number of DAGs with n nodes grows super-exponentially. The exact count a(n) satisfies the Robinson recurrence: a(n) = ∑k=1n (-1)k+1 C(n,k) 2k(n-k) a(n-k). The first few values:

Nodes (n)Number of DAGsStorage at 8 bytes each
118 bytes
2324 bytes
325200 bytes
529,281234 KB
7~1.1 billion9 GB
10~4.2 × 1018~30 million TB
20>1072not even close to possible

The jump from 7 nodes (barely fits on disk) to 10 nodes (more than all storage on Earth) underscores why heuristic search is the only option for real-world networks. The key insight is that n=10 is still a small Bayesian network — real medical or genetic networks have hundreds of variables.

Why is exhaustive enumeration of all Bayesian network structures infeasible?

Chapter 1: The Bayesian Score

To compare structures, we need a scoring function. The Bayesian score evaluates how well structure G fits data D by computing the log marginal likelihood — the probability of the data after integrating out (marginalizing) all possible parameter settings:

log P(D | G) = ∑i=1nj=1qi [ log Γij0) − log Γij0 + mij0) + ∑k=1ri log Γijk + mijk) − log Γijk) ]

Here mijk is the count of observations where variable i takes its k-th value with parents in their j-th configuration. The αijk are Dirichlet pseudocounts from the prior — which act as a built-in complexity penalty.

Because this integral marginalizes over all parameters, it scores the structure rather than any one parameter setting. Structures with many parameters need lots of data to justify their complexity — this is automatic Occam's razor.

The score decomposes node-by-node. Because the joint distribution over a BN factorizes as ∏ P(xi | pa(xi)), the Bayesian score also decomposes: log P(D|G) = ∑i score(i, pa(i), D). This local decomposition is crucial for efficient search — we can update the score of just one node when we add, remove, or reverse one edge.

Worked Example: Scoring a 3-Variable Graph

Let's compute the Bayesian score for a tiny dataset. We have 3 binary variables A, B, C and 8 data points. We compare two graphs: the empty graph (no edges) vs. A → B.

Data (each row is one observation of [A, B, C]):

#ABC
1000
2001
3010
4101
5110
6111
7110
8001

Use α = 1 (one virtual sample). For binary variables with qi parent configurations, αij0 = 1/qi, αijk = 1/(2qi).

Scoring node B with no parents (empty graph): B's value counts: B=0 appears 4 times, B=1 appears 4 times. With αij0=1, αijk=0.5:

score(B, ∅) = [Γ(1) − Γ(9)] + [Γ(4.5) − Γ(0.5)] + [Γ(4.5) − Γ(0.5)]

Computing: log Γ(1)=0, log Γ(9)=log(8!)≈10.60, log Γ(4.5)≈2.15, log Γ(0.5)≈0.572. Result: score(B, ∅) ≈ 0 − 10.60 + 2.15 − 0.572 + 2.15 − 0.572 = −7.45.

Scoring node B with parent A: Now q=2 (A=0 or A=1). For A=0: B=0 three times, B=1 once. For A=1: B=0 once, B=1 three times. With αij0=0.5, αijk=0.25:

score(B, {A}) ≈ 2×[Γ(0.5)−Γ(4.5)] + [Γ(3.25)+Γ(1.25)−2Γ(0.25)] + [Γ(1.25)+Γ(3.25)−2Γ(0.25)]

Numerically, this comes out to approximately −5.61. Since −5.61 > −7.45, the edge A→B is justified by this data. The score improved by 1.84 log-units — the evidence for A causing B is real.

Marginal likelihood ratios are Bayes factors. The difference in Bayesian scores is the log of the Bayes factor for comparing the two structures. A difference of 2 log-units corresponds to a Bayes factor of e2≈7.4 — "substantial" evidence by Jeffreys' scale.

The textbook uses the BDe prior (Bayesian Dirichlet equivalent) where the pseudocounts are chosen so that the score is equivalent class consistent: Markov-equivalent structures receive the same score. The simplest choice is αijk = α/(qi · ri) where α is the total number of virtual samples. Common defaults: α = 1 (one virtual sample spread uniformly over all configurations).

python
from math import lgamma  # log-gamma for numerical stability
from collections import defaultdict

def bayesian_score_node(data, child, parents, alpha=1.0, r_vals=2):
    """Local Bayesian score for one node given its parents (binary variables)."""
    pa_configs = defaultdict(lambda: defaultdict(int))
    for row in data:
        pa = tuple(row[p] for p in parents)
        pa_configs[pa][row[child]] += 1
    q = 2 ** len(parents)           # number of parent configurations
    alpha_ij0 = alpha / q            # pseudocounts per parent config
    alpha_ijk = alpha_ij0 / r_vals   # pseudocounts per (parent, child) cell
    score = 0.0
    for pa_cfg, child_counts in pa_configs.items():
        m_ij0 = sum(child_counts.values())     # total count for this parent config
        score += lgamma(alpha_ij0) - lgamma(alpha_ij0 + m_ij0)
        for k in range(r_vals):
            m_ijk = child_counts.get(k, 0)
            score += lgamma(alpha_ijk + m_ijk) - lgamma(alpha_ijk)
    return score

def bayesian_score(data, graph_edges, nodes):
    """Total Bayesian score = sum of local scores."""
    parents = {n: [] for n in nodes}
    for (pa, ch) in graph_edges:
        parents[ch].append(pa)
    return sum(bayesian_score_node(data, n, parents[n]) for n in nodes)
Score vs Data: Simple vs Complex Structures

Three 3-variable structures compete. With few data points, the simple model wins; with many, the true model wins. The complex model is never optimal here.

Data samples (log scale) 100
Why does the Bayesian score penalize complex structures when data is scarce?

Chapter 2: The BIC Score

The Bayesian score requires choosing Dirichlet pseudocounts. An alternative is the Bayesian Information Criterion (BIC), which approximates the log marginal likelihood using the maximum likelihood and a closed-form complexity penalty:

BIC(G, D) = ℓ(θ*, D, G) − dim(G)/2 · log m

Here ℓ(θ*, D, G) is the log-likelihood at the MLE parameters, dim(G) is the number of free parameters in the model, and m is the number of data points. The penalty term dim(G)/2 · log m grows with both model complexity and dataset size.

For binary variables in a graph G, the dimension is dim(G) = ∑i 2|pa(i)|. Each additional parent of node i doubles its CPT size. This exponential growth in parameters is the main reason we prefer sparse graphs.

BIC vs Bayesian score in practice. Both implement Occam's razor, but differently. BIC has no free prior parameters and converges to the correct structure as data grows (model selection consistency). The Bayesian score requires specifying pseudocounts but is exact rather than approximate. For most practical problems, they give similar results.

BIC vs AIC vs MDL: Three Flavors of Penalized Likelihood

The BIC is not the only penalized likelihood criterion. Three criteria dominate the structure learning literature:

CriterionFormulaPenalty Grows WithConsistency
AIC (Akaike Info. Criterion)ℓ(θ*) − dim(G)Only model complexity, not mNo — selects overly complex models asymptotically
BIC (Bayesian Info. Criterion)ℓ(θ*) − dim(G)/2 · log mBoth model complexity and mYes — recovers true structure as m → ∞
MDL (Minimum Description Length)Code length of data + modelBoth, but with universal priorYes — same asymptotic behavior as BIC

The key difference between AIC and BIC: AIC penalizes each additional parameter by exactly 1 (constant), while BIC penalizes by log(m)/2 (grows with data). As m increases, BIC penalizes more aggressively, which causes it to prefer simpler models. This is why BIC is consistent (correct asymptotically) while AIC is not.

In practice: with small data (m < 100), AIC is often preferred because it allows more complex models that can fit the data well. With large data, BIC is preferred because it correctly identifies the structure that generated the data. The textbook uses BIC as the default because structure learning problems typically have large datasets in real applications.

One important subtlety: BIC and AIC are derived under the assumption that data points are i.i.d. (independent and identically distributed). For Bayesian networks, this means assuming that the rows of the data matrix are independent observations from the same joint distribution. If there is temporal autocorrelation (time-series data) or hidden confounders, BIC may select incorrectly even with large data. The textbook assumes i.i.d. data throughout Chapters 1–5; Chapter 13 addresses time-dependent data separately.

A numerical illustration: for a candidate edge that adds 1 free parameter and improves the log-likelihood by 3.5 nats:

AIC delta = +3.5 − 1 = +2.5 → add the edge
BIC delta (m=100) = +3.5 − log(100)/2 = +3.5 − 2.3 = +1.2 → still add
BIC delta (m=2000) = +3.5 − log(2000)/2 = +3.5 − 3.8 = −0.3 → reject!

The same edge that AIC accepts for any dataset size is rejected by BIC when we have 2000+ data points — because with enough data, we have statistical power to determine the edge isn't really there.

python
import numpy as np
from collections import Counter

def bic_score(data, child, parents, r_child=2):
    """BIC score for child given parents (binary variables assumed)."""
    m = len(data)
    dim = r_child * (2 ** len(parents))  # free params per CPT row
    penalty = (dim / 2) * np.log(m)

    # Compute log-likelihood at MLE
    ll = 0.0
    for row in data:
        pa_cfg = tuple(row[p] for p in parents)
        x_val  = row[child]
        # Count occurrences of this (pa_cfg, x_val) pair
        n_pa = sum(1 for d in data if tuple(d[p] for p in parents) == pa_cfg)
        n_px = sum(1 for d in data if
                   tuple(d[p] for p in parents) == pa_cfg and d[child] == x_val)
        if n_pa > 0:
            ll += np.log(n_px / n_pa + 1e-10)
    return ll - penalty
BIC Penalty Grows with Data

Drag the slider to see how the BIC penalty (dim/2 × log m) grows. More data punishes extra parameters more harshly. The teal line shows when the penalty exceeds the gain for a hypothetical edge.

Data size m 100

A quick derivation of BIC: the marginal likelihood log P(D|G) is approximated by the Laplace approximation around the MLE. The Hessian of the log-likelihood contributes a term −(dim/2)log m + constant. The dominant terms are the MLE log-likelihood and this Hessian term, giving BIC = ℓ(θ*) − dim/2 · log m. The constant terms cancel when comparing models, which is why the textbook's formula omits them.

The BIC formula also illuminates why structure learning is harder with fewer data. With m=20 data points and a candidate edge adding 1 parameter to a binary CPT: BIC delta = log-likelihood gain − log(20)/2 ≈ LL gain − 1.5. A spurious edge that increases LL by 2 (due to noise, not signal) would be incorrectly accepted. With m=500: BIC delta = LL gain − log(500)/2 ≈ LL gain − 3.1 — the same spurious edge is now correctly rejected. This explains why structure learning requires large datasets (typically 5–10× the number of parameters in the network) to be reliable.

A node has 3 binary parents. How many free parameters does its CPT have?

Chapter 3: K2 Search

The K2 algorithm is one of the earliest practical structure learning methods. It requires a user-specified variable ordering and starts from an empty graph. For each variable (processed in order), it greedily adds the parent that most improves the score, stopping when no further addition helps.

1. Order
Fix a variable ordering: X1, X2, ..., Xn
2. Loop over variables
For each Xi: start with no parents. Repeatedly try adding a parent from {X1, ..., Xi-1}. Keep the best one if it improves the score.
3. Stop
Move to Xi+1 when no more parent additions improve the score

K2 is fast — it runs in polynomial time — but depends critically on the variable ordering. If the true parent comes later in the ordering than its child, K2 can never find that edge. Different orderings can produce very different graphs.

The ordering matters enormously. K2 can only add parents from variables earlier in the ordering. If A is the true cause of B but A appears after B in the ordering, K2 will never discover the A → B edge. Domain knowledge is crucial here. Alternatively, try multiple orderings and take the best result.

The Cooper & Herskovits (1992) Paper

K2 was introduced in "A Bayesian Method for the Induction of Probabilistic Networks from Data" (Cooper & Herskovits, Machine Learning, 1992). The paper was one of the first to treat Bayesian network structure learning as a Bayesian inference problem. The key insight: instead of searching for parameters given a fixed structure, we search for structures by computing their marginal likelihood.

The paper introduced the "BD score" (Bayesian Dirichlet), which is the same Bayesian score formula from Chapter 1 with a specific prior: the "parameter independence assumption" (each node's CPT is independent a priori) plus "parameter modularity" (the prior only depends on the local family, not the global graph). These two assumptions together yield the decomposable form: log P(D|G) = ∑i score(i, pa(i), D).

The original paper validated K2 on three benchmark datasets: the ALARM network (37 nodes, medical diagnosis), the expert-designed MUNIN network, and a small genetics application. On ALARM with 10,000 samples and the correct ordering, K2 recovered the true network exactly. This validation established Bayesian network structure learning as a practical tool and inspired decades of subsequent work.

Historical impact. The Cooper & Herskovits paper is one of the most cited papers in the machine learning literature, with over 5,000 citations. It launched the field of probabilistic graphical model learning. The BDe score variant (Heckerman et al., 1995), which ensures equivalence class consistency, became the standard default in structure learning packages and is what the textbook refers to as "the Bayesian score."

The K2 algorithm specifically uses this prior. The key formula for node i with parent configuration j and child value k, with mijk observed counts and alpha pseudocounts spread uniformly: this integrates the Dirichlet-multinomial conjugate pair analytically, giving a closed-form expression in terms of log-gamma functions (see code above). The beauty is that no numerical integration is needed — the marginal likelihood has an exact formula.

The Python library pgmpy provides a complete K2 implementation. The key function is pgmpy.estimators.K2Score combined with pgmpy.estimators.HillClimbSearch. The R package bnlearn is the reference implementation, with rsmax2() for K2 and hc() for hill climbing. Both libraries implement the incremental cache update optimization and produce equivalent results to the textbook's formulation.

The Cooper & Herskovits K2 implementation used a hardcoded max-parent limit of u=2 in the original experiments. Modern libraries default to u=10 or no limit, relying on the score function's built-in regularization. If computational budget is a concern, setting u=3 or u=4 captures most true edges while keeping CPT sizes manageable: a node with 4 binary parents has 24×2 = 32 parameters — already a substantial CPT that requires many samples to estimate reliably.

The original K2 paper (Cooper & Herskovits, 1992) also limits the number of parents per node to a user-specified maximum u. This controls CPT size and prevents overfitting. K2 runs in O(n · u2) score evaluations — fast even for 50+ variables if u is small.

K2 Design ChoiceEffect on Result
Variable orderingDetermines which edges are possible; a bad ordering misses true edges entirely
Max parents uLimits CPT size; prevents exponential parameter blowup
Score functionBayesian (with α) or BIC both work; Bayesian score is more principled
Stopping criterionNo improvement OR u parents reached — whichever comes first
K2 Search Step-by-Step

Watch K2 greedily add edges to maximize the score. "Step" adds one best parent. "Shuffle" tries a new random ordering. Teal edges are correct; orange are spurious.

Order: A,B,C,D

The Cooper & Herskovits K2 algorithm also introduced a practical heuristic for choosing the variable ordering: sort variables by the number of their likely parents (fewest parents first). This heuristic works because K2 can only add parents from earlier in the ordering — if a node is later in the ordering, more of its potential parents are available. In practice, topological sorting of a rough domain-expert graph is the standard approach for initializing the ordering.

What is the key limitation of K2 search?

Chapter 4: Local Graph Search

Local search (hill climbing) needs no variable ordering. Starting from an initial graph (often empty), it repeatedly considers all neighboring graphs — those reachable by one of three basic operations — and moves to the highest-scoring one:

Add Edge
Insert a new directed edge Xi → Xj (only if it does not create a cycle)
Remove Edge
Delete an existing directed edge Xi → Xj
Reverse Edge
Flip direction: Xi → Xj becomes Xj → Xi (only if this stays acyclic)

The search continues until no neighbor scores higher than the current graph. Since the Bayesian score decomposes node-by-node, evaluating each operation only requires recomputing scores for the two affected nodes — making each step cheap.

The decomposability payoff. Because score(G) = ∑i score(i, pa(i)), evaluating the effect of adding edge A → B requires only recomputing score(B, pa(B) ∪ {A}) — not rescoring the whole graph. This keeps each hill-climbing step O(n) rather than O(m) where m is the data size.
python
def local_search(nodes, score_fn, max_iter=1000):
    """Hill-climbing structure search. score_fn(i, parents) returns local score."""
    edges = set()  # start from empty graph
    for _ in range(max_iter):
        best_delta, best_op = 0, None
        for i in nodes:
            for j in nodes:
                if i == j: continue
                # Try add i->j
                if (i, j) not in edges and not creates_cycle(edges, i, j):
                    delta = score_fn(j, parents(j, edges) | {i}) - score_fn(j, parents(j, edges))
                    if delta > best_delta: best_delta, best_op = delta, ('add', i, j)
                # Try remove i->j
                if (i, j) in edges:
                    delta = score_fn(j, parents(j, edges) - {i}) - score_fn(j, parents(j, edges))
                    if delta > best_delta: best_delta, best_op = delta, ('remove', i, j)
        if best_op is None: break  # local optimum
        apply_op(edges, best_op)
    return edges

Complexity and Practical Optimizations

Naive local search evaluates all n2 possible edges at each step — O(n2) score evaluations per iteration, each costing O(m) to compute over m data points. For n=50 variables and m=10,000 data points, one step requires 2,500 score evaluations each scanning 10,000 rows: 25 million row reads per step. With 1,000 steps, that is 25 billion row reads. Prohibitively slow.

Two optimizations make local search tractable:

  1. Caching sufficient statistics: The Bayesian score for node i depends only on the counts mijk. These counts can be precomputed in one O(m × n) pass. After adding an edge A → B, only the count table for B needs updating: O(m) not O(m × n).
  2. Priority queue of candidates: Instead of evaluating all O(n2) edge candidates at each step, maintain a priority queue sorted by score delta. Only re-evaluate candidates involving the modified nodes after each operation. Reduces per-step cost to O(n) amortized.
OptimizationTime SavingImplementation Cost
Precomputed count tablesTurns O(m) per candidate into O(riqi) lookupLow — one-pass scan at start
Incremental cache updateOnly update counts for affected node after each opMedium — careful bookkeeping
Priority queue of candidatesReduces to O(n) re-evaluations per stepMedium — standard heap operations
Early stopping on reversalOnly try reversals of existing edges, not all pairsLow — check edge set before evaluating

These optimizations are implemented in standard libraries. The pgmpy Python package implements the full optimized local search. The bnlearn R package is the reference implementation used in most structure learning benchmarks. Both can handle networks with 100+ nodes in minutes.

Which of the three graph operations in local search must be checked for cycle creation?

Chapter 5: Markov Equivalence

Here is a surprising fact: two very different-looking Bayesian networks can be statistically identical. The networks A → B and A ← B make the same conditional independence statement (A and B are dependent, no independences). Any dataset generated from one could equally well have been generated from the other with suitable parameters. We cannot tell edge directions apart from data alone in this case.

Two DAGs are Markov equivalent if and only if they encode the exact same set of conditional independence relationships. The textbook gives the elegant characterization (Verma & Pearl, 1990): two DAGs are Markov equivalent if and only if they have (1) the same skeleton (same edges ignoring direction), and (2) the same v-structures (immoralities).

What is a v-structure (immorality)? A pattern X → Z ← Y where X and Y are not adjacent is called a v-structure. It is the ONLY pattern where edge direction changes the encoded independence structure: in X → Z ← Y, X and Y are marginally independent but become dependent given Z. Reversing either edge would break this property.

The Bayesian score with BDe priors assigns identical scores to all members of the same Markov equivalence class. This means structure learning can recover the correct equivalence class but not always the exact DAG. V-structures and domain knowledge are needed to determine edge directions within a class.

Causal reasoning requires more than conditional independence. Suppose we learn the CPDAG A — B → C from data. We know B causes C (direction is certain), but we don't know if A causes B or B causes A. If we intervene on A (a randomized experiment), we can resolve this ambiguity: if changing A changes B, then A → B. If not, then A ← B. The CPDAG represents the limit of what observational data can tell us; the resolved DAG requires intervention. This is Pearl's fundamental argument for why causal inference is distinct from statistical inference.

A worked example: consider A → B → C, A ← B → C, and A → B ← C. The first two have the same skeleton (A—B, B—C) and no v-structures: they are Markov equivalent (both encode A ⊥ C | B). The third (A → B ← C) has a v-structure at B: A and C are marginally independent but become dependent given B. It forms its own equivalence class. Data can distinguish the third from the first two, but cannot distinguish the first from the second.

This has profound implications for causal discovery. Causality requires knowing edge directions. Observational data can recover the CPDAG (equivalence class) but not causal directions within it. Resolving directions requires: (1) domain knowledge (time ordering, physical constraints), (2) interventional data (randomized experiments), or (3) additional assumptions (non-Gaussian noise models, as in LiNGAM).

Markov Equivalence Explorer

Graph 1 (left) is fixed as A→B→C. Flip edges in Graph 2 (right). Orange edges form v-structures. When the v-structures match, the graphs are Markov equivalent.

Checking equivalence...
When do two DAGs encode different conditional independence statements?

Chapter 6: CPDAG Search

A completed partially directed acyclic graph (CPDAG), also called an essential graph, is a compact representation of an entire Markov equivalence class. It contains both directed and undirected edges: directed edges appear where all members of the class agree on the direction; undirected edges appear where the direction varies.

The textbook describes a key insight: instead of searching the space of DAGs and encountering many equivalent graphs with identical scores, we can search directly in the space of CPDAGs. This space is roughly 3.7 times smaller than the DAG space on average, reducing the number of equally-scoring plateaus that confuse hill climbing.

Greedy Equivalence Search (GES), developed by Chickering (2002) and Meek (1995), is the premier score-based structure learning algorithm that operates in CPDAG space. GES has three phases:

  1. Forward phase: Start from the empty CPDAG. Repeatedly add the edge (in CPDAG space) that most improves the Bayesian score, until no improvement is possible.
  2. Backward phase: Repeatedly remove edges that improve the score, until no improvement.
  3. Turning phase: Reverse certain edge orientations to further improve the score.

GES is asymptotically correct: under faithfulness, it recovers the true CPDAG as the data size goes to infinity. No other score-based algorithm has this guarantee in polynomial time.

The operations in CPDAG space are more complex than the simple add/remove/reverse operations in DAG space. The textbook describes operators based on Chickering (2002): each operator maintains the CPDAG representation invariant while moving between equivalence classes. The most important are:

OperatorEffectCondition for Validity
Insert(X, Y, T)Add edge X — Y (or X → Y if directed) to create/modify a v-structureX, Y not adjacent; T is a subset of neighbors of Y
Delete(X, Y, H)Remove edge X — Y; H handles v-structure maintenanceX, Y adjacent; H is a subset of neighbors
Reverse(X, Y)Turn X → Y into X ← Y within one equivalence classEdge is currently directed; reversing stays in same class

DAG space

  • Each equivalence class represented multiple times
  • Hill climbing creates score plateaus (equivalent DAGs)
  • Must track and avoid revisiting equivalent structures

CPDAG space

  • Each equivalence class represented exactly once
  • No score plateaus from equivalences
  • Requires specialized edge operations for CPDAGs
Search equivalence classes, not DAGs. The Bayesian score (with BDe priors) is equivalent class consistent: it assigns the same score to all DAGs in the same class. So any score differences between neighboring structures reflect true structural differences, not direction ambiguity. Searching in CPDAG space removes the artificial degeneracy.
CPDAG: Essential Graph Example

A CPDAG (left) with one directed and one undirected edge. Click "Show Members" to enumerate all DAGs in this equivalence class. Directed edges are shared; undirected edges vary.

Click to enumerate members
What does an undirected edge in a CPDAG mean?

Chapter 7: Escaping Local Optima

Hill climbing is greedy: it always moves to the best neighbor and stops at the first local optimum. But local optima are not global optima. Studies show that structure learning landscapes are rugged — many local optima exist even for relatively small networks.

Escaping Local Optima: A Practical Comparison

Four main strategies appear in the literature and textbook:

StrategyMechanismPros / Cons
Random restartsRun hill climbing from 10–50 random initial graphs; take the bestSimple, parallelizable; may hit same optima repeatedly
Simulated annealingAccept worse moves with probability eΔ/T; decrease T over timeCan escape all optima given time; sensitive to cooling schedule
Tabu searchForbid recently tried operations for k stepsForces new exploration; tabu list length is a hyperparameter
Genetic algorithmsMaintain a population of graphs; crossover + mutateDiverse exploration; crossing BN structures is non-trivial

In benchmarks (Tsamardinos et al., 2006), random restarts with 20 starts consistently outperform more sophisticated strategies on most real datasets. The lesson: local optima for structure learning are less severe than expected, because the BIC score landscape is smoother than general combinatorial optimization. Use restarts first; add annealing only if restarts fail.

Several strategies help escape:

StrategyKey IdeaTradeoff
Random restartsRun multiple hill-climbs from random starting graphs; keep best resultSimple; embarrassingly parallel; O(k × search cost)
Simulated annealingAccept worse moves with probability e−Δ/T where T decreases over timeMore exploration early; requires tuning cooling schedule
Tabu searchMaintain a list of recently visited graphs; forbid revisiting themKeeps momentum; needs memory for tabu list
Genetic algorithmsMaintain a population; combine (crossover) good graphs, mutate themPopulation-based; complex to implement correctly

Random restarts is the most widely used in practice. It is simple to implement, trivially parallelizable, and works well when the scoring landscape is not too rugged. The textbook recommends running at least 10–20 restarts for small problems.

Simulated annealing for structure search uses the same acceptance rule as the classical algorithm: at temperature T, a move with score change Δ is accepted with probability min(1, eΔ/T). The temperature decreases according to a cooling schedule, typically geometric: Tk+1 = c · Tk where c ∈ (0, 1) is the cooling rate. The initial temperature T0 is typically set so that the acceptance probability for a "typical" bad move is about 0.8 (early in the search, accept almost anything).

python
import math, random

def simulated_annealing_structure(nodes, score_fn, T0=10.0, cool=0.99, max_iter=5000):
    """Structure search with simulated annealing."""
    edges = set()
    current_score = score_fn(edges)
    best_edges, best_score = set(edges), current_score
    T = T0
    for step in range(max_iter):
        T *= cool
        # Propose a random operation
        a, b = random.sample(nodes, 2)
        new_edges = set(edges)
        if (a, b) in edges:
            new_edges.discard((a, b))     # remove
        elif (b, a) not in edges:
            new_edges.add((a, b))         # add
        else:
            new_edges.discard((b, a)); new_edges.add((a, b))  # reverse
        if has_cycle(new_edges): continue
        new_score = score_fn(new_edges)
        delta = new_score - current_score
        if delta > 0 or random.random() < math.exp(delta / max(T, 1e-10)):
            edges, current_score = new_edges, new_score
            if current_score > best_score:
                best_edges, best_score = set(edges), current_score
    return best_edges
Local optima are the norm, not the exception. Simulation studies show that even simple structure learning problems have many local optima. A single hill-climbing run from an empty graph often fails to find the globally best structure. Always use restarts.
Greedy vs Simulated Annealing

Both search a rugged landscape with a local and global optimum. Greedy gets stuck at the local peak. Annealing can escape it. Click to run.

Click to run
How does simulated annealing differ from greedy hill climbing?

Chapter 8: Showcase — Interactive Hill Climbing

This is the full structure learning experience. You control a 5-node network. There is a hidden "true" graph (shown as ghost edges). The algorithm tries to recover it using hill climbing with restarts. Watch the Bayesian score climb as edges are added, removed, or reversed.

Try clicking "Step" to manually execute one operation. "Auto Run" runs 50 steps. "Restart" starts from a new random graph. The score display shows the current BIC approximation. See how many restarts it takes to find the true structure.

Score = fitness of the current graph. The BIC score shown is relative to the empty graph. Teal edges are in the true structure; orange are spurious (hurt the score but were locally accepted). Ghost dashed edges show what the algorithm should find.
Hill Climbing: 5-Node Structure Recovery

Ghost dashed yellow edges = true structure. Teal = correctly found. Orange = spurious. Adjust temperature to control how often bad moves are accepted.

Temperature (acceptance of bad moves) 0.00
Score: 0 | Steps: 0
In this showcase, what happens when you increase the temperature slider?

Chapter 9: Connections & What's Next

ConceptWhat We LearnedWhere It Goes
Bayesian scoreMarginal likelihood that integrates out parameters; built-in Occam's razorStructure search objective
BIC scoreClosed-form approximation; no prior pseudocounts neededCommon default in practice
K2 searchGreedy given an ordering; fast but ordering-dependentFast initial estimate
Local searchHill climbing with add/remove/reverse; order-freeMost widely used approach
Markov equivalenceDifferent DAGs can encode same independences; only v-structures distinguish themCausal discovery (requires experiments)
CPDAG searchSearch equivalence classes to avoid score plateausPC algorithm, GES algorithm
Escaping optimaRestarts, annealing, tabu, genetic algorithmsGeneral heuristic optimization
From learning to acting. Chapter 6 extends Bayesian networks with action nodes and utility nodes, creating decision networks (influence diagrams). Everything we have learned about inference and structure carries forward. The new ingredient is the maximum expected utility principle.

The PC Algorithm (Peter-Clark)

The PC algorithm (Spirtes, Glymour & Scheines, 1993) takes a fundamentally different approach from score-based search. Instead of scoring graphs, it performs conditional independence tests on the data and constructs the CPDAG directly.

Phase 1: Skeleton
Start with a complete undirected graph. Remove edge X — Y if there exists a set Z such that X ⊥ Y | Z in the data. Test increasingly larger conditioning sets.
Phase 2: V-structures
For each triple X — Z — Y where X — Y is absent: if Z was not in the separating set of X and Y, orient as X → Z ← Y (collider).
Phase 3: Propagate
Apply Meek's orientation rules to orient remaining undirected edges without creating new v-structures or cycles. The result is the CPDAG.

Under three assumptions — (1) the data-generating distribution is Markov to some DAG, (2) faithfulness (no exact cancellations), and (3) we use consistent independence tests — PC correctly recovers the true CPDAG in the infinite-data limit. With finite data, false positives in independence tests can cause errors.

The GES Algorithm (Greedy Equivalence Search)

The GES algorithm (Chickering, 2002) is a score-based method that searches directly in CPDAG space rather than DAG space. It runs in two phases:

PhaseOperationEffect
ForwardInsert: repeatedly add one edge to the CPDAG that maximally improves the score, until no improvement existsOverfits (finds a dense structure)
BackwardDelete: repeatedly remove one edge from the CPDAG that improves the score, until no improvementPrunes spurious edges

GES is proven to consistently recover the true CPDAG in the large-sample limit for BIC scoring. Unlike PC, it doesn't need explicit independence tests. In practice, GES often outperforms PC on medium-dimensional datasets.

Score-based vs constraint-based: two paradigms. Score-based methods (hill climbing, K2, GES) optimize a single criterion and naturally handle sampling noise. Constraint-based methods (PC) make explicit independence decisions that can propagate errors. Hybrid methods (FGES, PCMB) combine the strengths of both. Most modern production structure learning uses score-based methods with greedy search and random restarts.
"Everything should be made as simple as possible, but not simpler."
— attributed to Albert Einstein

The Bayesian score embodies this exactly: the simplest model that explains the data wins.

Practical Structure Learning: A Decision Guide

With so many algorithms to choose from, how do you pick? The answer depends on your priorities:

PriorityBest AlgorithmReasoning
Speed with small dataK2 with expert orderingFastest; domain experts often know rough causal ordering
No prior knowledge availableLocal search (hill climbing) with 10+ restartsOrder-free; works on any dataset
Statistical rigor (consistency)GES (Chickering 2002) or PC algorithmConsistent estimators: recover true CPDAG asymptotically
Very large networks (100+ nodes)FGES (Fast GES) or constraint-based with sparse testsO(n2) or better; scales to thousands of variables
Causal discovery (not just correlation)PC algorithm + domain expert review of directed edgesPC recovers orientable edges; experts resolve the rest
Missing dataEM-based score + local searchImpute missing values inside the scoring loop

For most real applications, the textbook recommends: (1) run local search with BIC from 10–20 random restarts, taking the best result, (2) validate the learned structure against domain knowledge, (3) flip any edges that domain experts identify as directionally wrong. This practical approach outperforms any single algorithm run once.

Which algorithm can recover the correct CPDAG from data under infinite-sample, faithfulness, and causal Markov assumptions?
← Chapter 4: Parameter Learning Chapter 6: Simple Decisions →