Structure Learning — Kochenderfer et al., Chapter 5

Chapter 0: The Structure Problem

Chapter 4 assumed you already knew the Bayesian network structure — which variables are parents of which. But in real applications, you rarely have that luxury. You have a spreadsheet of observations and the question: what graph produced this data?

This is structure learning: finding a directed acyclic graph (DAG) G that best explains the observed dataset D. It seems straightforward until you count the search space. With 10 variables there are 4.2 × 10¹⁸ possible DAGs. With 20 variables: 2.3 × 10⁷². No computer can enumerate them. We need smarter strategies.

The core challenge has two parts: (1) how do we score a graph to decide if it is good? (2) how do we search the space of graphs efficiently? This chapter answers both.

Structure learning is NP-hard. Finding the optimal Bayesian network structure from data is provably NP-hard. No polynomial-time algorithm exists unless P=NP. In practice, we use heuristic search algorithms that find good — but not necessarily optimal — structures.

The Combinatorial Explosion

Use the slider to see how the count of possible DAGs (Robinson's formula) grows with the number of nodes. Even 6 nodes have over 3.7 million possible structures.

Number of nodes 5

How Many DAGs Exist? Robinson's Formula

The number of DAGs with n nodes grows super-exponentially. The exact count a(n) satisfies the Robinson recurrence: a(n) = ∑_k=1ⁿ (-1)^k+1 C(n,k) 2^k(n-k) a(n-k). The first few values:

Nodes (n)	Number of DAGs	Storage at 8 bytes each
1	1	8 bytes
2	3	24 bytes
3	25	200 bytes
5	29,281	234 KB
7	~1.1 billion	9 GB
10	~4.2 × 10¹⁸	~30 million TB
20	>10⁷²	not even close to possible

The jump from 7 nodes (barely fits on disk) to 10 nodes (more than all storage on Earth) underscores why heuristic search is the only option for real-world networks. The key insight is that n=10 is still a small Bayesian network — real medical or genetic networks have hundreds of variables.

Why is exhaustive enumeration of all Bayesian network structures infeasible?

The number of possible DAGs grows super-exponentially with the number of variables We cannot score individual structures DAGs are uncountably infinite

Chapter 1: The Bayesian Score

To compare structures, we need a scoring function. The Bayesian score evaluates how well structure G fits data D by computing the log marginal likelihood — the probability of the data after integrating out (marginalizing) all possible parameter settings:

log P(D | G) = ∑_i=1ⁿ ∑_j=1^q_i [ log Γ(α_ij0) − log Γ(α_ij0 + m_ij0) + ∑_k=1^r_i log Γ(α_ijk + m_ijk) − log Γ(α_ijk) ]

Here m_ijk is the count of observations where variable i takes its k-th value with parents in their j-th configuration. The α_ijk are Dirichlet pseudocounts from the prior — which act as a built-in complexity penalty.

Because this integral marginalizes over all parameters, it scores the structure rather than any one parameter setting. Structures with many parameters need lots of data to justify their complexity — this is automatic Occam's razor.

The score decomposes node-by-node. Because the joint distribution over a BN factorizes as ∏ P(x_i | pa(x_i)), the Bayesian score also decomposes: log P(D|G) = ∑_i score(i, pa(i), D). This local decomposition is crucial for efficient search — we can update the score of just one node when we add, remove, or reverse one edge.

Worked Example: Scoring a 3-Variable Graph

Let's compute the Bayesian score for a tiny dataset. We have 3 binary variables A, B, C and 8 data points. We compare two graphs: the empty graph (no edges) vs. A → B.

Data (each row is one observation of [A, B, C]):

#	A	B	C
1	0	0	0
2	0	0	1
3	0	1	0
4	1	0	1
5	1	1	0
6	1	1	1
7	1	1	0
8	0	0	1

Use α = 1 (one virtual sample). For binary variables with q_i parent configurations, α_ij0 = 1/q_i, α_ijk = 1/(2q_i).

Scoring node B with no parents (empty graph): B's value counts: B=0 appears 4 times, B=1 appears 4 times. With α_ij0=1, α_ijk=0.5:

score(B, ∅) = [Γ(1) − Γ(9)] + [Γ(4.5) − Γ(0.5)] + [Γ(4.5) − Γ(0.5)]

Computing: log Γ(1)=0, log Γ(9)=log(8!)≈10.60, log Γ(4.5)≈2.15, log Γ(0.5)≈0.572. Result: score(B, ∅) ≈ 0 − 10.60 + 2.15 − 0.572 + 2.15 − 0.572 = −7.45.

Scoring node B with parent A: Now q=2 (A=0 or A=1). For A=0: B=0 three times, B=1 once. For A=1: B=0 once, B=1 three times. With α_ij0=0.5, α_ijk=0.25:

score(B, {A}) ≈ 2×[Γ(0.5)−Γ(4.5)] + [Γ(3.25)+Γ(1.25)−2Γ(0.25)] + [Γ(1.25)+Γ(3.25)−2Γ(0.25)]

Numerically, this comes out to approximately −5.61. Since −5.61 > −7.45, the edge A→B is justified by this data. The score improved by 1.84 log-units — the evidence for A causing B is real.

Marginal likelihood ratios are Bayes factors. The difference in Bayesian scores is the log of the Bayes factor for comparing the two structures. A difference of 2 log-units corresponds to a Bayes factor of e²≈7.4 — "substantial" evidence by Jeffreys' scale.

The textbook uses the BDe prior (Bayesian Dirichlet equivalent) where the pseudocounts are chosen so that the score is equivalent class consistent: Markov-equivalent structures receive the same score. The simplest choice is α_ijk = α/(q_i · r_i) where α is the total number of virtual samples. Common defaults: α = 1 (one virtual sample spread uniformly over all configurations).

python
from math import lgamma  # log-gamma for numerical stability
from collections import defaultdict

def bayesian_score_node(data, child, parents, alpha=1.0, r_vals=2):
    """Local Bayesian score for one node given its parents (binary variables)."""
    pa_configs = defaultdict(lambda: defaultdict(int))
    for row in data:
        pa = tuple(row[p] for p in parents)
        pa_configs[pa][row[child]] += 1
    q = 2 ** len(parents)           # number of parent configurations
    alpha_ij0 = alpha / q            # pseudocounts per parent config
    alpha_ijk = alpha_ij0 / r_vals   # pseudocounts per (parent, child) cell
    score = 0.0
    for pa_cfg, child_counts in pa_configs.items():
        m_ij0 = sum(child_counts.values())     # total count for this parent config
        score += lgamma(alpha_ij0) - lgamma(alpha_ij0 + m_ij0)
        for k in range(r_vals):
            m_ijk = child_counts.get(k, 0)
            score += lgamma(alpha_ijk + m_ijk) - lgamma(alpha_ijk)
    return score

def bayesian_score(data, graph_edges, nodes):
    """Total Bayesian score = sum of local scores."""
    parents = {n: [] for n in nodes}
    for (pa, ch) in graph_edges:
        parents[ch].append(pa)
    return sum(bayesian_score_node(data, n, parents[n]) for n in nodes)

Score vs Data: Simple vs Complex Structures

Three 3-variable structures compete. With few data points, the simple model wins; with many, the true model wins. The complex model is never optimal here.

Data samples (log scale) 100

Why does the Bayesian score penalize complex structures when data is scarce?

Integrating over parameters dilutes the likelihood for models with too many parameters when data is sparse It includes an explicit penalty term proportional to edge count Complex structures always have lower likelihood regardless of data size

Chapter 2: The BIC Score

The Bayesian score requires choosing Dirichlet pseudocounts. An alternative is the Bayesian Information Criterion (BIC), which approximates the log marginal likelihood using the maximum likelihood and a closed-form complexity penalty:

BIC(G, D) = ℓ(θ^*, D, G) − ^dim(G)/₂ · log m

Here ℓ(θ^*, D, G) is the log-likelihood at the MLE parameters, dim(G) is the number of free parameters in the model, and m is the number of data points. The penalty term dim(G)/2 · log m grows with both model complexity and dataset size.

For binary variables in a graph G, the dimension is dim(G) = ∑_i 2^|pa(i)|. Each additional parent of node i doubles its CPT size. This exponential growth in parameters is the main reason we prefer sparse graphs.

BIC vs Bayesian score in practice. Both implement Occam's razor, but differently. BIC has no free prior parameters and converges to the correct structure as data grows (model selection consistency). The Bayesian score requires specifying pseudocounts but is exact rather than approximate. For most practical problems, they give similar results.

BIC vs AIC vs MDL: Three Flavors of Penalized Likelihood

The BIC is not the only penalized likelihood criterion. Three criteria dominate the structure learning literature:

Criterion	Formula	Penalty Grows With	Consistency
AIC (Akaike Info. Criterion)	ℓ(θ*) − dim(G)	Only model complexity, not m	No — selects overly complex models asymptotically
BIC (Bayesian Info. Criterion)	ℓ(θ*) − dim(G)/2 · log m	Both model complexity and m	Yes — recovers true structure as m → ∞
MDL (Minimum Description Length)	Code length of data + model	Both, but with universal prior	Yes — same asymptotic behavior as BIC

The key difference between AIC and BIC: AIC penalizes each additional parameter by exactly 1 (constant), while BIC penalizes by log(m)/2 (grows with data). As m increases, BIC penalizes more aggressively, which causes it to prefer simpler models. This is why BIC is consistent (correct asymptotically) while AIC is not.

In practice: with small data (m < 100), AIC is often preferred because it allows more complex models that can fit the data well. With large data, BIC is preferred because it correctly identifies the structure that generated the data. The textbook uses BIC as the default because structure learning problems typically have large datasets in real applications.

One important subtlety: BIC and AIC are derived under the assumption that data points are i.i.d. (independent and identically distributed). For Bayesian networks, this means assuming that the rows of the data matrix are independent observations from the same joint distribution. If there is temporal autocorrelation (time-series data) or hidden confounders, BIC may select incorrectly even with large data. The textbook assumes i.i.d. data throughout Chapters 1–5; Chapter 13 addresses time-dependent data separately.

A numerical illustration: for a candidate edge that adds 1 free parameter and improves the log-likelihood by 3.5 nats:

AIC delta = +3.5 − 1 = +2.5 → add the edge

BIC delta (m=100) = +3.5 − log(100)/2 = +3.5 − 2.3 = +1.2 → still add

BIC delta (m=2000) = +3.5 − log(2000)/2 = +3.5 − 3.8 = −0.3 → reject!

The same edge that AIC accepts for any dataset size is rejected by BIC when we have 2000+ data points — because with enough data, we have statistical power to determine the edge isn't really there.

python
import numpy as np
from collections import Counter

def bic_score(data, child, parents, r_child=2):
    """BIC score for child given parents (binary variables assumed)."""
    m = len(data)
    dim = r_child * (2 ** len(parents))  # free params per CPT row
    penalty = (dim / 2) * np.log(m)

    # Compute log-likelihood at MLE
    ll = 0.0
    for row in data:
        pa_cfg = tuple(row[p] for p in parents)
        x_val  = row[child]
        # Count occurrences of this (pa_cfg, x_val) pair
        n_pa = sum(1 for d in data if tuple(d[p] for p in parents) == pa_cfg)
        n_px = sum(1 for d in data if
                   tuple(d[p] for p in parents) == pa_cfg and d[child] == x_val)
        if n_pa > 0:
            ll += np.log(n_px / n_pa + 1e-10)
    return ll - penalty

BIC Penalty Grows with Data

Drag the slider to see how the BIC penalty (dim/2 × log m) grows. More data punishes extra parameters more harshly. The teal line shows when the penalty exceeds the gain for a hypothetical edge.

Data size m 100

A quick derivation of BIC: the marginal likelihood log P(D|G) is approximated by the Laplace approximation around the MLE. The Hessian of the log-likelihood contributes a term −(dim/2)log m + constant. The dominant terms are the MLE log-likelihood and this Hessian term, giving BIC = ℓ(θ*) − dim/2 · log m. The constant terms cancel when comparing models, which is why the textbook's formula omits them.

The BIC formula also illuminates why structure learning is harder with fewer data. With m=20 data points and a candidate edge adding 1 parameter to a binary CPT: BIC delta = log-likelihood gain − log(20)/2 ≈ LL gain − 1.5. A spurious edge that increases LL by 2 (due to noise, not signal) would be incorrectly accepted. With m=500: BIC delta = LL gain − log(500)/2 ≈ LL gain − 3.1 — the same spurious edge is now correctly rejected. This explains why structure learning requires large datasets (typically 5–10× the number of parameters in the network) to be reliable.

A node has 3 binary parents. How many free parameters does its CPT have?

8 — 2³ parent configurations, each needs 1 free parameter (binary child means one value determines the other) 3 16

Chapter 3: K2 Search

The K2 algorithm is one of the earliest practical structure learning methods. It requires a user-specified variable ordering and starts from an empty graph. For each variable (processed in order), it greedily adds the parent that most improves the score, stopping when no further addition helps.

1. Order

Fix a variable ordering: X₁, X₂, ..., X_n

↓

2. Loop over variables

For each X_i: start with no parents. Repeatedly try adding a parent from {X₁, ..., X_i-1}. Keep the best one if it improves the score.

↓

3. Stop

Move to X_i+1 when no more parent additions improve the score

K2 is fast — it runs in polynomial time — but depends critically on the variable ordering. If the true parent comes later in the ordering than its child, K2 can never find that edge. Different orderings can produce very different graphs.

The ordering matters enormously. K2 can only add parents from variables earlier in the ordering. If A is the true cause of B but A appears after B in the ordering, K2 will never discover the A → B edge. Domain knowledge is crucial here. Alternatively, try multiple orderings and take the best result.

The Cooper & Herskovits (1992) Paper

K2 was introduced in "A Bayesian Method for the Induction of Probabilistic Networks from Data" (Cooper & Herskovits, Machine Learning, 1992). The paper was one of the first to treat Bayesian network structure learning as a Bayesian inference problem. The key insight: instead of searching for parameters given a fixed structure, we search for structures by computing their marginal likelihood.

The paper introduced the "BD score" (Bayesian Dirichlet), which is the same Bayesian score formula from Chapter 1 with a specific prior: the "parameter independence assumption" (each node's CPT is independent a priori) plus "parameter modularity" (the prior only depends on the local family, not the global graph). These two assumptions together yield the decomposable form: log P(D|G) = ∑_i score(i, pa(i), D).

The original paper validated K2 on three benchmark datasets: the ALARM network (37 nodes, medical diagnosis), the expert-designed MUNIN network, and a small genetics application. On ALARM with 10,000 samples and the correct ordering, K2 recovered the true network exactly. This validation established Bayesian network structure learning as a practical tool and inspired decades of subsequent work.

Historical impact. The Cooper & Herskovits paper is one of the most cited papers in the machine learning literature, with over 5,000 citations. It launched the field of probabilistic graphical model learning. The BDe score variant (Heckerman et al., 1995), which ensures equivalence class consistency, became the standard default in structure learning packages and is what the textbook refers to as "the Bayesian score."

The K2 algorithm specifically uses this prior. The key formula for node i with parent configuration j and child value k, with m_ijk observed counts and alpha pseudocounts spread uniformly: this integrates the Dirichlet-multinomial conjugate pair analytically, giving a closed-form expression in terms of log-gamma functions (see code above). The beauty is that no numerical integration is needed — the marginal likelihood has an exact formula.

The Python library pgmpy provides a complete K2 implementation. The key function is pgmpy.estimators.K2Score combined with pgmpy.estimators.HillClimbSearch. The R package bnlearn is the reference implementation, with rsmax2() for K2 and hc() for hill climbing. Both libraries implement the incremental cache update optimization and produce equivalent results to the textbook's formulation.

The Cooper & Herskovits K2 implementation used a hardcoded max-parent limit of u=2 in the original experiments. Modern libraries default to u=10 or no limit, relying on the score function's built-in regularization. If computational budget is a concern, setting u=3 or u=4 captures most true edges while keeping CPT sizes manageable: a node with 4 binary parents has 2⁴×2 = 32 parameters — already a substantial CPT that requires many samples to estimate reliably.

The original K2 paper (Cooper & Herskovits, 1992) also limits the number of parents per node to a user-specified maximum u. This controls CPT size and prevents overfitting. K2 runs in O(n · u²) score evaluations — fast even for 50+ variables if u is small.

K2 Design Choice	Effect on Result
Variable ordering	Determines which edges are possible; a bad ordering misses true edges entirely
Max parents u	Limits CPT size; prevents exponential parameter blowup
Score function	Bayesian (with α) or BIC both work; Bayesian score is more principled
Stopping criterion	No improvement OR u parents reached — whichever comes first

K2 Search Step-by-Step

Watch K2 greedily add edges to maximize the score. "Step" adds one best parent. "Shuffle" tries a new random ordering. Teal edges are correct; orange are spurious.

Order: A,B,C,D

The Cooper & Herskovits K2 algorithm also introduced a practical heuristic for choosing the variable ordering: sort variables by the number of their likely parents (fewest parents first). This heuristic works because K2 can only add parents from earlier in the ordering — if a node is later in the ordering, more of its potential parents are available. In practice, topological sorting of a rough domain-expert graph is the standard approach for initializing the ordering.

What is the key limitation of K2 search?

It requires a variable ordering and cannot discover edges from variables appearing later in the ordering to those appearing earlier It always produces fully connected graphs It cannot handle more than 5 variables

Chapter 4: Local Graph Search

Local search (hill climbing) needs no variable ordering. Starting from an initial graph (often empty), it repeatedly considers all neighboring graphs — those reachable by one of three basic operations — and moves to the highest-scoring one:

Add Edge

Insert a new directed edge X_i → X_j (only if it does not create a cycle)

↓

Remove Edge

Delete an existing directed edge X_i → X_j

↓

Reverse Edge

Flip direction: X_i → X_j becomes X_j → X_i (only if this stays acyclic)

The search continues until no neighbor scores higher than the current graph. Since the Bayesian score decomposes node-by-node, evaluating each operation only requires recomputing scores for the two affected nodes — making each step cheap.

The decomposability payoff. Because score(G) = ∑_i score(i, pa(i)), evaluating the effect of adding edge A → B requires only recomputing score(B, pa(B) ∪ {A}) — not rescoring the whole graph. This keeps each hill-climbing step O(n) rather than O(m) where m is the data size.

python
def local_search(nodes, score_fn, max_iter=1000):
    """Hill-climbing structure search. score_fn(i, parents) returns local score."""
    edges = set()  # start from empty graph
    for _ in range(max_iter):
        best_delta, best_op = 0, None
        for i in nodes:
            for j in nodes:
                if i == j: continue
                # Try add i->j
                if (i, j) not in edges and not creates_cycle(edges, i, j):
                    delta = score_fn(j, parents(j, edges) | {i}) - score_fn(j, parents(j, edges))
                    if delta > best_delta: best_delta, best_op = delta, ('add', i, j)
                # Try remove i->j
                if (i, j) in edges:
                    delta = score_fn(j, parents(j, edges) - {i}) - score_fn(j, parents(j, edges))
                    if delta > best_delta: best_delta, best_op = delta, ('remove', i, j)
        if best_op is None: break  # local optimum
        apply_op(edges, best_op)
    return edges

Complexity and Practical Optimizations

Naive local search evaluates all n² possible edges at each step — O(n²) score evaluations per iteration, each costing O(m) to compute over m data points. For n=50 variables and m=10,000 data points, one step requires 2,500 score evaluations each scanning 10,000 rows: 25 million row reads per step. With 1,000 steps, that is 25 billion row reads. Prohibitively slow.

Two optimizations make local search tractable:

Caching sufficient statistics: The Bayesian score for node i depends only on the counts m_ijk. These counts can be precomputed in one O(m × n) pass. After adding an edge A → B, only the count table for B needs updating: O(m) not O(m × n).
Priority queue of candidates: Instead of evaluating all O(n²) edge candidates at each step, maintain a priority queue sorted by score delta. Only re-evaluate candidates involving the modified nodes after each operation. Reduces per-step cost to O(n) amortized.

Optimization	Time Saving	Implementation Cost
Precomputed count tables	Turns O(m) per candidate into O(r_iq_i) lookup	Low — one-pass scan at start
Incremental cache update	Only update counts for affected node after each op	Medium — careful bookkeeping
Priority queue of candidates	Reduces to O(n) re-evaluations per step	Medium — standard heap operations
Early stopping on reversal	Only try reversals of existing edges, not all pairs	Low — check edge set before evaluating

These optimizations are implemented in standard libraries. The pgmpy Python package implements the full optimized local search. The bnlearn R package is the reference implementation used in most structure learning benchmarks. Both can handle networks with 100+ nodes in minutes.

Which of the three graph operations in local search must be checked for cycle creation?

Add edge and reverse edge — both can create cycles; removing an edge never does Only reverse edge All three

Chapter 5: Markov Equivalence

Here is a surprising fact: two very different-looking Bayesian networks can be statistically identical. The networks A → B and A ← B make the same conditional independence statement (A and B are dependent, no independences). Any dataset generated from one could equally well have been generated from the other with suitable parameters. We cannot tell edge directions apart from data alone in this case.

Two DAGs are Markov equivalent if and only if they encode the exact same set of conditional independence relationships. The textbook gives the elegant characterization (Verma & Pearl, 1990): two DAGs are Markov equivalent if and only if they have (1) the same skeleton (same edges ignoring direction), and (2) the same v-structures (immoralities).

What is a v-structure (immorality)? A pattern X → Z ← Y where X and Y are not adjacent is called a v-structure. It is the ONLY pattern where edge direction changes the encoded independence structure: in X → Z ← Y, X and Y are marginally independent but become dependent given Z. Reversing either edge would break this property.

The Bayesian score with BDe priors assigns identical scores to all members of the same Markov equivalence class. This means structure learning can recover the correct equivalence class but not always the exact DAG. V-structures and domain knowledge are needed to determine edge directions within a class.

Causal reasoning requires more than conditional independence. Suppose we learn the CPDAG A — B → C from data. We know B causes C (direction is certain), but we don't know if A causes B or B causes A. If we intervene on A (a randomized experiment), we can resolve this ambiguity: if changing A changes B, then A → B. If not, then A ← B. The CPDAG represents the limit of what observational data can tell us; the resolved DAG requires intervention. This is Pearl's fundamental argument for why causal inference is distinct from statistical inference.

A worked example: consider A → B → C, A ← B → C, and A → B ← C. The first two have the same skeleton (A—B, B—C) and no v-structures: they are Markov equivalent (both encode A ⊥ C | B). The third (A → B ← C) has a v-structure at B: A and C are marginally independent but become dependent given B. It forms its own equivalence class. Data can distinguish the third from the first two, but cannot distinguish the first from the second.

This has profound implications for causal discovery. Causality requires knowing edge directions. Observational data can recover the CPDAG (equivalence class) but not causal directions within it. Resolving directions requires: (1) domain knowledge (time ordering, physical constraints), (2) interventional data (randomized experiments), or (3) additional assumptions (non-Gaussian noise models, as in LiNGAM).

Markov Equivalence Explorer

Graph 1 (left) is fixed as A→B→C. Flip edges in Graph 2 (right). Orange edges form v-structures. When the v-structures match, the graphs are Markov equivalent.

Checking equivalence...

When do two DAGs encode different conditional independence statements?

When they have different skeletons or different sets of v-structures Whenever any edge direction differs Only when they have different numbers of edges

Chapter 6: CPDAG Search

A completed partially directed acyclic graph (CPDAG), also called an essential graph, is a compact representation of an entire Markov equivalence class. It contains both directed and undirected edges: directed edges appear where all members of the class agree on the direction; undirected edges appear where the direction varies.

The textbook describes a key insight: instead of searching the space of DAGs and encountering many equivalent graphs with identical scores, we can search directly in the space of CPDAGs. This space is roughly 3.7 times smaller than the DAG space on average, reducing the number of equally-scoring plateaus that confuse hill climbing.

Greedy Equivalence Search (GES), developed by Chickering (2002) and Meek (1995), is the premier score-based structure learning algorithm that operates in CPDAG space. GES has three phases:

Forward phase: Start from the empty CPDAG. Repeatedly add the edge (in CPDAG space) that most improves the Bayesian score, until no improvement is possible.
Backward phase: Repeatedly remove edges that improve the score, until no improvement.
Turning phase: Reverse certain edge orientations to further improve the score.

GES is asymptotically correct: under faithfulness, it recovers the true CPDAG as the data size goes to infinity. No other score-based algorithm has this guarantee in polynomial time.

The operations in CPDAG space are more complex than the simple add/remove/reverse operations in DAG space. The textbook describes operators based on Chickering (2002): each operator maintains the CPDAG representation invariant while moving between equivalence classes. The most important are:

Operator	Effect	Condition for Validity
Insert(X, Y, T)	Add edge X — Y (or X → Y if directed) to create/modify a v-structure	X, Y not adjacent; T is a subset of neighbors of Y
Delete(X, Y, H)	Remove edge X — Y; H handles v-structure maintenance	X, Y adjacent; H is a subset of neighbors
Reverse(X, Y)	Turn X → Y into X ← Y within one equivalence class	Edge is currently directed; reversing stays in same class

DAG space

Each equivalence class represented multiple times
Hill climbing creates score plateaus (equivalent DAGs)
Must track and avoid revisiting equivalent structures

CPDAG space

Each equivalence class represented exactly once
No score plateaus from equivalences
Requires specialized edge operations for CPDAGs

Search equivalence classes, not DAGs. The Bayesian score (with BDe priors) is equivalent class consistent: it assigns the same score to all DAGs in the same class. So any score differences between neighboring structures reflect true structural differences, not direction ambiguity. Searching in CPDAG space removes the artificial degeneracy.

CPDAG: Essential Graph Example

A CPDAG (left) with one directed and one undirected edge. Click "Show Members" to enumerate all DAGs in this equivalence class. Directed edges are shared; undirected edges vary.

Click to enumerate members

What does an undirected edge in a CPDAG mean?

Some DAGs in the equivalence class have the edge going one way, others have it going the other way — data cannot determine the direction The edge is absent in all member DAGs The edge is bidirectional in all member DAGs

Chapter 7: Escaping Local Optima

Hill climbing is greedy: it always moves to the best neighbor and stops at the first local optimum. But local optima are not global optima. Studies show that structure learning landscapes are rugged — many local optima exist even for relatively small networks.

Escaping Local Optima: A Practical Comparison

Four main strategies appear in the literature and textbook:

Strategy	Mechanism	Pros / Cons
Random restarts	Run hill climbing from 10–50 random initial graphs; take the best	Simple, parallelizable; may hit same optima repeatedly
Simulated annealing	Accept worse moves with probability e^Δ/T; decrease T over time	Can escape all optima given time; sensitive to cooling schedule
Tabu search	Forbid recently tried operations for k steps	Forces new exploration; tabu list length is a hyperparameter
Genetic algorithms	Maintain a population of graphs; crossover + mutate	Diverse exploration; crossing BN structures is non-trivial

In benchmarks (Tsamardinos et al., 2006), random restarts with 20 starts consistently outperform more sophisticated strategies on most real datasets. The lesson: local optima for structure learning are less severe than expected, because the BIC score landscape is smoother than general combinatorial optimization. Use restarts first; add annealing only if restarts fail.

Several strategies help escape:

Strategy	Key Idea	Tradeoff
Random restarts	Run multiple hill-climbs from random starting graphs; keep best result	Simple; embarrassingly parallel; O(k × search cost)
Simulated annealing	Accept worse moves with probability e^−Δ/T where T decreases over time	More exploration early; requires tuning cooling schedule
Tabu search	Maintain a list of recently visited graphs; forbid revisiting them	Keeps momentum; needs memory for tabu list
Genetic algorithms	Maintain a population; combine (crossover) good graphs, mutate them	Population-based; complex to implement correctly

Random restarts is the most widely used in practice. It is simple to implement, trivially parallelizable, and works well when the scoring landscape is not too rugged. The textbook recommends running at least 10–20 restarts for small problems.

Simulated annealing for structure search uses the same acceptance rule as the classical algorithm: at temperature T, a move with score change Δ is accepted with probability min(1, e^Δ/T). The temperature decreases according to a cooling schedule, typically geometric: T_k+1 = c · T_k where c ∈ (0, 1) is the cooling rate. The initial temperature T₀ is typically set so that the acceptance probability for a "typical" bad move is about 0.8 (early in the search, accept almost anything).

python
import math, random

def simulated_annealing_structure(nodes, score_fn, T0=10.0, cool=0.99, max_iter=5000):
    """Structure search with simulated annealing."""
    edges = set()
    current_score = score_fn(edges)
    best_edges, best_score = set(edges), current_score
    T = T0
    for step in range(max_iter):
        T *= cool
        # Propose a random operation
        a, b = random.sample(nodes, 2)
        new_edges = set(edges)
        if (a, b) in edges:
            new_edges.discard((a, b))     # remove
        elif (b, a) not in edges:
            new_edges.add((a, b))         # add
        else:
            new_edges.discard((b, a)); new_edges.add((a, b))  # reverse
        if has_cycle(new_edges): continue
        new_score = score_fn(new_edges)
        delta = new_score - current_score
        if delta > 0 or random.random() < math.exp(delta / max(T, 1e-10)):
            edges, current_score = new_edges, new_score
            if current_score > best_score:
                best_edges, best_score = set(edges), current_score
    return best_edges

Local optima are the norm, not the exception. Simulation studies show that even simple structure learning problems have many local optima. A single hill-climbing run from an empty graph often fails to find the globally best structure. Always use restarts.

Greedy vs Simulated Annealing

Both search a rugged landscape with a local and global optimum. Greedy gets stuck at the local peak. Annealing can escape it. Click to run.

Click to run

How does simulated annealing differ from greedy hill climbing?

Annealing occasionally accepts moves to worse graphs, with this probability decreasing over time Annealing always moves to the worst neighbor Annealing randomly restarts from scratch every iteration

Chapter 8: Showcase — Interactive Hill Climbing

This is the full structure learning experience. You control a 5-node network. There is a hidden "true" graph (shown as ghost edges). The algorithm tries to recover it using hill climbing with restarts. Watch the Bayesian score climb as edges are added, removed, or reversed.

Try clicking "Step" to manually execute one operation. "Auto Run" runs 50 steps. "Restart" starts from a new random graph. The score display shows the current BIC approximation. See how many restarts it takes to find the true structure.

Score = fitness of the current graph. The BIC score shown is relative to the empty graph. Teal edges are in the true structure; orange are spurious (hurt the score but were locally accepted). Ghost dashed edges show what the algorithm should find.

Hill Climbing: 5-Node Structure Recovery

Ghost dashed yellow edges = true structure. Teal = correctly found. Orange = spurious. Adjust temperature to control how often bad moves are accepted.

Temperature (acceptance of bad moves) 0.00

Score: 0 | Steps: 0

In this showcase, what happens when you increase the temperature slider?

The algorithm occasionally accepts moves to lower-scoring graphs, which can help escape local optima but also introduces noise The algorithm converges faster The true structure is revealed immediately

Chapter 9: Connections & What's Next

Concept	What We Learned	Where It Goes
Bayesian score	Marginal likelihood that integrates out parameters; built-in Occam's razor	Structure search objective
BIC score	Closed-form approximation; no prior pseudocounts needed	Common default in practice
K2 search	Greedy given an ordering; fast but ordering-dependent	Fast initial estimate
Local search	Hill climbing with add/remove/reverse; order-free	Most widely used approach
Markov equivalence	Different DAGs can encode same independences; only v-structures distinguish them	Causal discovery (requires experiments)
CPDAG search	Search equivalence classes to avoid score plateaus	PC algorithm, GES algorithm
Escaping optima	Restarts, annealing, tabu, genetic algorithms	General heuristic optimization

From learning to acting. Chapter 6 extends Bayesian networks with action nodes and utility nodes, creating decision networks (influence diagrams). Everything we have learned about inference and structure carries forward. The new ingredient is the maximum expected utility principle.

The PC Algorithm (Peter-Clark)

The PC algorithm (Spirtes, Glymour & Scheines, 1993) takes a fundamentally different approach from score-based search. Instead of scoring graphs, it performs conditional independence tests on the data and constructs the CPDAG directly.

Phase 1: Skeleton

Start with a complete undirected graph. Remove edge X — Y if there exists a set Z such that X ⊥ Y | Z in the data. Test increasingly larger conditioning sets.

↓

Phase 2: V-structures

For each triple X — Z — Y where X — Y is absent: if Z was not in the separating set of X and Y, orient as X → Z ← Y (collider).

↓

Phase 3: Propagate

Apply Meek's orientation rules to orient remaining undirected edges without creating new v-structures or cycles. The result is the CPDAG.

Under three assumptions — (1) the data-generating distribution is Markov to some DAG, (2) faithfulness (no exact cancellations), and (3) we use consistent independence tests — PC correctly recovers the true CPDAG in the infinite-data limit. With finite data, false positives in independence tests can cause errors.

The GES Algorithm (Greedy Equivalence Search)

The GES algorithm (Chickering, 2002) is a score-based method that searches directly in CPDAG space rather than DAG space. It runs in two phases:

Phase	Operation	Effect
Forward	Insert: repeatedly add one edge to the CPDAG that maximally improves the score, until no improvement exists	Overfits (finds a dense structure)
Backward	Delete: repeatedly remove one edge from the CPDAG that improves the score, until no improvement	Prunes spurious edges

GES is proven to consistently recover the true CPDAG in the large-sample limit for BIC scoring. Unlike PC, it doesn't need explicit independence tests. In practice, GES often outperforms PC on medium-dimensional datasets.

Score-based vs constraint-based: two paradigms. Score-based methods (hill climbing, K2, GES) optimize a single criterion and naturally handle sampling noise. Constraint-based methods (PC) make explicit independence decisions that can propagate errors. Hybrid methods (FGES, PCMB) combine the strengths of both. Most modern production structure learning uses score-based methods with greedy search and random restarts.

"Everything should be made as simple as possible, but not simpler."

— attributed to Albert Einstein

The Bayesian score embodies this exactly: the simplest model that explains the data wins.

Practical Structure Learning: A Decision Guide

With so many algorithms to choose from, how do you pick? The answer depends on your priorities:

Priority	Best Algorithm	Reasoning
Speed with small data	K2 with expert ordering	Fastest; domain experts often know rough causal ordering
No prior knowledge available	Local search (hill climbing) with 10+ restarts	Order-free; works on any dataset
Statistical rigor (consistency)	GES (Chickering 2002) or PC algorithm	Consistent estimators: recover true CPDAG asymptotically
Very large networks (100+ nodes)	FGES (Fast GES) or constraint-based with sparse tests	O(n²) or better; scales to thousands of variables
Causal discovery (not just correlation)	PC algorithm + domain expert review of directed edges	PC recovers orientable edges; experts resolve the rest
Missing data	EM-based score + local search	Impute missing values inside the scoring loop

For most real applications, the textbook recommends: (1) run local search with BIC from 10–20 random restarts, taking the best result, (2) validate the learned structure against domain knowledge, (3) flip any edges that domain experts identify as directionally wrong. This practical approach outperforms any single algorithm run once.

Which algorithm can recover the correct CPDAG from data under infinite-sample, faithfulness, and causal Markov assumptions?

The PC algorithm — it identifies the CPDAG through constraint-based conditional independence tests K2 with the correct ordering always recovers the true graph No algorithm can recover structure from observational data alone

← Chapter 4: Parameter Learning Chapter 6: Simple Decisions →