You have data but no graph. This chapter builds Bayesian network structure from scratch — scoring how well every possible graph explains the data, then searching a super-exponential space to find the best one.
Chapter 4 assumed you already knew the Bayesian network structure — which variables are parents of which. But in real applications, you rarely have that luxury. You have a spreadsheet of observations and the question: what graph produced this data?
This is structure learning: finding a directed acyclic graph (DAG) G that best explains the observed dataset D. It seems straightforward until you count the search space. With 10 variables there are 4.2 × 1018 possible DAGs. With 20 variables: 2.3 × 1072. No computer can enumerate them. We need smarter strategies.
The core challenge has two parts: (1) how do we score a graph to decide if it is good? (2) how do we search the space of graphs efficiently? This chapter answers both.
Use the slider to see how the count of possible DAGs (Robinson's formula) grows with the number of nodes. Even 6 nodes have over 3.7 million possible structures.
The number of DAGs with n nodes grows super-exponentially. The exact count a(n) satisfies the Robinson recurrence: a(n) = ∑k=1n (-1)k+1 C(n,k) 2k(n-k) a(n-k). The first few values:
| Nodes (n) | Number of DAGs | Storage at 8 bytes each |
|---|---|---|
| 1 | 1 | 8 bytes |
| 2 | 3 | 24 bytes |
| 3 | 25 | 200 bytes |
| 5 | 29,281 | 234 KB |
| 7 | ~1.1 billion | 9 GB |
| 10 | ~4.2 × 1018 | ~30 million TB |
| 20 | >1072 | not even close to possible |
The jump from 7 nodes (barely fits on disk) to 10 nodes (more than all storage on Earth) underscores why heuristic search is the only option for real-world networks. The key insight is that n=10 is still a small Bayesian network — real medical or genetic networks have hundreds of variables.
To compare structures, we need a scoring function. The Bayesian score evaluates how well structure G fits data D by computing the log marginal likelihood — the probability of the data after integrating out (marginalizing) all possible parameter settings:
Here mijk is the count of observations where variable i takes its k-th value with parents in their j-th configuration. The αijk are Dirichlet pseudocounts from the prior — which act as a built-in complexity penalty.
Because this integral marginalizes over all parameters, it scores the structure rather than any one parameter setting. Structures with many parameters need lots of data to justify their complexity — this is automatic Occam's razor.
Let's compute the Bayesian score for a tiny dataset. We have 3 binary variables A, B, C and 8 data points. We compare two graphs: the empty graph (no edges) vs. A → B.
Data (each row is one observation of [A, B, C]):
| # | A | B | C |
|---|---|---|---|
| 1 | 0 | 0 | 0 |
| 2 | 0 | 0 | 1 |
| 3 | 0 | 1 | 0 |
| 4 | 1 | 0 | 1 |
| 5 | 1 | 1 | 0 |
| 6 | 1 | 1 | 1 |
| 7 | 1 | 1 | 0 |
| 8 | 0 | 0 | 1 |
Use α = 1 (one virtual sample). For binary variables with qi parent configurations, αij0 = 1/qi, αijk = 1/(2qi).
Scoring node B with no parents (empty graph): B's value counts: B=0 appears 4 times, B=1 appears 4 times. With αij0=1, αijk=0.5:
Computing: log Γ(1)=0, log Γ(9)=log(8!)≈10.60, log Γ(4.5)≈2.15, log Γ(0.5)≈0.572. Result: score(B, ∅) ≈ 0 − 10.60 + 2.15 − 0.572 + 2.15 − 0.572 = −7.45.
Scoring node B with parent A: Now q=2 (A=0 or A=1). For A=0: B=0 three times, B=1 once. For A=1: B=0 once, B=1 three times. With αij0=0.5, αijk=0.25:
Numerically, this comes out to approximately −5.61. Since −5.61 > −7.45, the edge A→B is justified by this data. The score improved by 1.84 log-units — the evidence for A causing B is real.
The textbook uses the BDe prior (Bayesian Dirichlet equivalent) where the pseudocounts are chosen so that the score is equivalent class consistent: Markov-equivalent structures receive the same score. The simplest choice is αijk = α/(qi · ri) where α is the total number of virtual samples. Common defaults: α = 1 (one virtual sample spread uniformly over all configurations).
python from math import lgamma # log-gamma for numerical stability from collections import defaultdict def bayesian_score_node(data, child, parents, alpha=1.0, r_vals=2): """Local Bayesian score for one node given its parents (binary variables).""" pa_configs = defaultdict(lambda: defaultdict(int)) for row in data: pa = tuple(row[p] for p in parents) pa_configs[pa][row[child]] += 1 q = 2 ** len(parents) # number of parent configurations alpha_ij0 = alpha / q # pseudocounts per parent config alpha_ijk = alpha_ij0 / r_vals # pseudocounts per (parent, child) cell score = 0.0 for pa_cfg, child_counts in pa_configs.items(): m_ij0 = sum(child_counts.values()) # total count for this parent config score += lgamma(alpha_ij0) - lgamma(alpha_ij0 + m_ij0) for k in range(r_vals): m_ijk = child_counts.get(k, 0) score += lgamma(alpha_ijk + m_ijk) - lgamma(alpha_ijk) return score def bayesian_score(data, graph_edges, nodes): """Total Bayesian score = sum of local scores.""" parents = {n: [] for n in nodes} for (pa, ch) in graph_edges: parents[ch].append(pa) return sum(bayesian_score_node(data, n, parents[n]) for n in nodes)
Three 3-variable structures compete. With few data points, the simple model wins; with many, the true model wins. The complex model is never optimal here.
The Bayesian score requires choosing Dirichlet pseudocounts. An alternative is the Bayesian Information Criterion (BIC), which approximates the log marginal likelihood using the maximum likelihood and a closed-form complexity penalty:
Here ℓ(θ*, D, G) is the log-likelihood at the MLE parameters, dim(G) is the number of free parameters in the model, and m is the number of data points. The penalty term dim(G)/2 · log m grows with both model complexity and dataset size.
For binary variables in a graph G, the dimension is dim(G) = ∑i 2|pa(i)|. Each additional parent of node i doubles its CPT size. This exponential growth in parameters is the main reason we prefer sparse graphs.
The BIC is not the only penalized likelihood criterion. Three criteria dominate the structure learning literature:
| Criterion | Formula | Penalty Grows With | Consistency |
|---|---|---|---|
| AIC (Akaike Info. Criterion) | ℓ(θ*) − dim(G) | Only model complexity, not m | No — selects overly complex models asymptotically |
| BIC (Bayesian Info. Criterion) | ℓ(θ*) − dim(G)/2 · log m | Both model complexity and m | Yes — recovers true structure as m → ∞ |
| MDL (Minimum Description Length) | Code length of data + model | Both, but with universal prior | Yes — same asymptotic behavior as BIC |
The key difference between AIC and BIC: AIC penalizes each additional parameter by exactly 1 (constant), while BIC penalizes by log(m)/2 (grows with data). As m increases, BIC penalizes more aggressively, which causes it to prefer simpler models. This is why BIC is consistent (correct asymptotically) while AIC is not.
In practice: with small data (m < 100), AIC is often preferred because it allows more complex models that can fit the data well. With large data, BIC is preferred because it correctly identifies the structure that generated the data. The textbook uses BIC as the default because structure learning problems typically have large datasets in real applications.
One important subtlety: BIC and AIC are derived under the assumption that data points are i.i.d. (independent and identically distributed). For Bayesian networks, this means assuming that the rows of the data matrix are independent observations from the same joint distribution. If there is temporal autocorrelation (time-series data) or hidden confounders, BIC may select incorrectly even with large data. The textbook assumes i.i.d. data throughout Chapters 1–5; Chapter 13 addresses time-dependent data separately.
A numerical illustration: for a candidate edge that adds 1 free parameter and improves the log-likelihood by 3.5 nats:
The same edge that AIC accepts for any dataset size is rejected by BIC when we have 2000+ data points — because with enough data, we have statistical power to determine the edge isn't really there.
python import numpy as np from collections import Counter def bic_score(data, child, parents, r_child=2): """BIC score for child given parents (binary variables assumed).""" m = len(data) dim = r_child * (2 ** len(parents)) # free params per CPT row penalty = (dim / 2) * np.log(m) # Compute log-likelihood at MLE ll = 0.0 for row in data: pa_cfg = tuple(row[p] for p in parents) x_val = row[child] # Count occurrences of this (pa_cfg, x_val) pair n_pa = sum(1 for d in data if tuple(d[p] for p in parents) == pa_cfg) n_px = sum(1 for d in data if tuple(d[p] for p in parents) == pa_cfg and d[child] == x_val) if n_pa > 0: ll += np.log(n_px / n_pa + 1e-10) return ll - penalty
Drag the slider to see how the BIC penalty (dim/2 × log m) grows. More data punishes extra parameters more harshly. The teal line shows when the penalty exceeds the gain for a hypothetical edge.
A quick derivation of BIC: the marginal likelihood log P(D|G) is approximated by the Laplace approximation around the MLE. The Hessian of the log-likelihood contributes a term −(dim/2)log m + constant. The dominant terms are the MLE log-likelihood and this Hessian term, giving BIC = ℓ(θ*) − dim/2 · log m. The constant terms cancel when comparing models, which is why the textbook's formula omits them.
The BIC formula also illuminates why structure learning is harder with fewer data. With m=20 data points and a candidate edge adding 1 parameter to a binary CPT: BIC delta = log-likelihood gain − log(20)/2 ≈ LL gain − 1.5. A spurious edge that increases LL by 2 (due to noise, not signal) would be incorrectly accepted. With m=500: BIC delta = LL gain − log(500)/2 ≈ LL gain − 3.1 — the same spurious edge is now correctly rejected. This explains why structure learning requires large datasets (typically 5–10× the number of parameters in the network) to be reliable.
The K2 algorithm is one of the earliest practical structure learning methods. It requires a user-specified variable ordering and starts from an empty graph. For each variable (processed in order), it greedily adds the parent that most improves the score, stopping when no further addition helps.
K2 is fast — it runs in polynomial time — but depends critically on the variable ordering. If the true parent comes later in the ordering than its child, K2 can never find that edge. Different orderings can produce very different graphs.
K2 was introduced in "A Bayesian Method for the Induction of Probabilistic Networks from Data" (Cooper & Herskovits, Machine Learning, 1992). The paper was one of the first to treat Bayesian network structure learning as a Bayesian inference problem. The key insight: instead of searching for parameters given a fixed structure, we search for structures by computing their marginal likelihood.
The paper introduced the "BD score" (Bayesian Dirichlet), which is the same Bayesian score formula from Chapter 1 with a specific prior: the "parameter independence assumption" (each node's CPT is independent a priori) plus "parameter modularity" (the prior only depends on the local family, not the global graph). These two assumptions together yield the decomposable form: log P(D|G) = ∑i score(i, pa(i), D).
The original paper validated K2 on three benchmark datasets: the ALARM network (37 nodes, medical diagnosis), the expert-designed MUNIN network, and a small genetics application. On ALARM with 10,000 samples and the correct ordering, K2 recovered the true network exactly. This validation established Bayesian network structure learning as a practical tool and inspired decades of subsequent work.
The K2 algorithm specifically uses this prior. The key formula for node i with parent configuration j and child value k, with mijk observed counts and alpha pseudocounts spread uniformly: this integrates the Dirichlet-multinomial conjugate pair analytically, giving a closed-form expression in terms of log-gamma functions (see code above). The beauty is that no numerical integration is needed — the marginal likelihood has an exact formula.
The Python library pgmpy provides a complete K2 implementation. The key function is pgmpy.estimators.K2Score combined with pgmpy.estimators.HillClimbSearch. The R package bnlearn is the reference implementation, with rsmax2() for K2 and hc() for hill climbing. Both libraries implement the incremental cache update optimization and produce equivalent results to the textbook's formulation.
The Cooper & Herskovits K2 implementation used a hardcoded max-parent limit of u=2 in the original experiments. Modern libraries default to u=10 or no limit, relying on the score function's built-in regularization. If computational budget is a concern, setting u=3 or u=4 captures most true edges while keeping CPT sizes manageable: a node with 4 binary parents has 24×2 = 32 parameters — already a substantial CPT that requires many samples to estimate reliably.
The original K2 paper (Cooper & Herskovits, 1992) also limits the number of parents per node to a user-specified maximum u. This controls CPT size and prevents overfitting. K2 runs in O(n · u2) score evaluations — fast even for 50+ variables if u is small.
| K2 Design Choice | Effect on Result |
|---|---|
| Variable ordering | Determines which edges are possible; a bad ordering misses true edges entirely |
| Max parents u | Limits CPT size; prevents exponential parameter blowup |
| Score function | Bayesian (with α) or BIC both work; Bayesian score is more principled |
| Stopping criterion | No improvement OR u parents reached — whichever comes first |
Watch K2 greedily add edges to maximize the score. "Step" adds one best parent. "Shuffle" tries a new random ordering. Teal edges are correct; orange are spurious.
The Cooper & Herskovits K2 algorithm also introduced a practical heuristic for choosing the variable ordering: sort variables by the number of their likely parents (fewest parents first). This heuristic works because K2 can only add parents from earlier in the ordering — if a node is later in the ordering, more of its potential parents are available. In practice, topological sorting of a rough domain-expert graph is the standard approach for initializing the ordering.
Local search (hill climbing) needs no variable ordering. Starting from an initial graph (often empty), it repeatedly considers all neighboring graphs — those reachable by one of three basic operations — and moves to the highest-scoring one:
The search continues until no neighbor scores higher than the current graph. Since the Bayesian score decomposes node-by-node, evaluating each operation only requires recomputing scores for the two affected nodes — making each step cheap.
python def local_search(nodes, score_fn, max_iter=1000): """Hill-climbing structure search. score_fn(i, parents) returns local score.""" edges = set() # start from empty graph for _ in range(max_iter): best_delta, best_op = 0, None for i in nodes: for j in nodes: if i == j: continue # Try add i->j if (i, j) not in edges and not creates_cycle(edges, i, j): delta = score_fn(j, parents(j, edges) | {i}) - score_fn(j, parents(j, edges)) if delta > best_delta: best_delta, best_op = delta, ('add', i, j) # Try remove i->j if (i, j) in edges: delta = score_fn(j, parents(j, edges) - {i}) - score_fn(j, parents(j, edges)) if delta > best_delta: best_delta, best_op = delta, ('remove', i, j) if best_op is None: break # local optimum apply_op(edges, best_op) return edges
Naive local search evaluates all n2 possible edges at each step — O(n2) score evaluations per iteration, each costing O(m) to compute over m data points. For n=50 variables and m=10,000 data points, one step requires 2,500 score evaluations each scanning 10,000 rows: 25 million row reads per step. With 1,000 steps, that is 25 billion row reads. Prohibitively slow.
Two optimizations make local search tractable:
| Optimization | Time Saving | Implementation Cost |
|---|---|---|
| Precomputed count tables | Turns O(m) per candidate into O(riqi) lookup | Low — one-pass scan at start |
| Incremental cache update | Only update counts for affected node after each op | Medium — careful bookkeeping |
| Priority queue of candidates | Reduces to O(n) re-evaluations per step | Medium — standard heap operations |
| Early stopping on reversal | Only try reversals of existing edges, not all pairs | Low — check edge set before evaluating |
These optimizations are implemented in standard libraries. The pgmpy Python package implements the full optimized local search. The bnlearn R package is the reference implementation used in most structure learning benchmarks. Both can handle networks with 100+ nodes in minutes.
Here is a surprising fact: two very different-looking Bayesian networks can be statistically identical. The networks A → B and A ← B make the same conditional independence statement (A and B are dependent, no independences). Any dataset generated from one could equally well have been generated from the other with suitable parameters. We cannot tell edge directions apart from data alone in this case.
Two DAGs are Markov equivalent if and only if they encode the exact same set of conditional independence relationships. The textbook gives the elegant characterization (Verma & Pearl, 1990): two DAGs are Markov equivalent if and only if they have (1) the same skeleton (same edges ignoring direction), and (2) the same v-structures (immoralities).
The Bayesian score with BDe priors assigns identical scores to all members of the same Markov equivalence class. This means structure learning can recover the correct equivalence class but not always the exact DAG. V-structures and domain knowledge are needed to determine edge directions within a class.
Causal reasoning requires more than conditional independence. Suppose we learn the CPDAG A — B → C from data. We know B causes C (direction is certain), but we don't know if A causes B or B causes A. If we intervene on A (a randomized experiment), we can resolve this ambiguity: if changing A changes B, then A → B. If not, then A ← B. The CPDAG represents the limit of what observational data can tell us; the resolved DAG requires intervention. This is Pearl's fundamental argument for why causal inference is distinct from statistical inference.
A worked example: consider A → B → C, A ← B → C, and A → B ← C. The first two have the same skeleton (A—B, B—C) and no v-structures: they are Markov equivalent (both encode A ⊥ C | B). The third (A → B ← C) has a v-structure at B: A and C are marginally independent but become dependent given B. It forms its own equivalence class. Data can distinguish the third from the first two, but cannot distinguish the first from the second.
This has profound implications for causal discovery. Causality requires knowing edge directions. Observational data can recover the CPDAG (equivalence class) but not causal directions within it. Resolving directions requires: (1) domain knowledge (time ordering, physical constraints), (2) interventional data (randomized experiments), or (3) additional assumptions (non-Gaussian noise models, as in LiNGAM).
Graph 1 (left) is fixed as A→B→C. Flip edges in Graph 2 (right). Orange edges form v-structures. When the v-structures match, the graphs are Markov equivalent.
A completed partially directed acyclic graph (CPDAG), also called an essential graph, is a compact representation of an entire Markov equivalence class. It contains both directed and undirected edges: directed edges appear where all members of the class agree on the direction; undirected edges appear where the direction varies.
The textbook describes a key insight: instead of searching the space of DAGs and encountering many equivalent graphs with identical scores, we can search directly in the space of CPDAGs. This space is roughly 3.7 times smaller than the DAG space on average, reducing the number of equally-scoring plateaus that confuse hill climbing.
Greedy Equivalence Search (GES), developed by Chickering (2002) and Meek (1995), is the premier score-based structure learning algorithm that operates in CPDAG space. GES has three phases:
GES is asymptotically correct: under faithfulness, it recovers the true CPDAG as the data size goes to infinity. No other score-based algorithm has this guarantee in polynomial time.
The operations in CPDAG space are more complex than the simple add/remove/reverse operations in DAG space. The textbook describes operators based on Chickering (2002): each operator maintains the CPDAG representation invariant while moving between equivalence classes. The most important are:
| Operator | Effect | Condition for Validity |
|---|---|---|
| Insert(X, Y, T) | Add edge X — Y (or X → Y if directed) to create/modify a v-structure | X, Y not adjacent; T is a subset of neighbors of Y |
| Delete(X, Y, H) | Remove edge X — Y; H handles v-structure maintenance | X, Y adjacent; H is a subset of neighbors |
| Reverse(X, Y) | Turn X → Y into X ← Y within one equivalence class | Edge is currently directed; reversing stays in same class |
DAG space
CPDAG space
A CPDAG (left) with one directed and one undirected edge. Click "Show Members" to enumerate all DAGs in this equivalence class. Directed edges are shared; undirected edges vary.
Hill climbing is greedy: it always moves to the best neighbor and stops at the first local optimum. But local optima are not global optima. Studies show that structure learning landscapes are rugged — many local optima exist even for relatively small networks.
Four main strategies appear in the literature and textbook:
| Strategy | Mechanism | Pros / Cons |
|---|---|---|
| Random restarts | Run hill climbing from 10–50 random initial graphs; take the best | Simple, parallelizable; may hit same optima repeatedly |
| Simulated annealing | Accept worse moves with probability eΔ/T; decrease T over time | Can escape all optima given time; sensitive to cooling schedule |
| Tabu search | Forbid recently tried operations for k steps | Forces new exploration; tabu list length is a hyperparameter |
| Genetic algorithms | Maintain a population of graphs; crossover + mutate | Diverse exploration; crossing BN structures is non-trivial |
In benchmarks (Tsamardinos et al., 2006), random restarts with 20 starts consistently outperform more sophisticated strategies on most real datasets. The lesson: local optima for structure learning are less severe than expected, because the BIC score landscape is smoother than general combinatorial optimization. Use restarts first; add annealing only if restarts fail.
Several strategies help escape:
| Strategy | Key Idea | Tradeoff |
|---|---|---|
| Random restarts | Run multiple hill-climbs from random starting graphs; keep best result | Simple; embarrassingly parallel; O(k × search cost) |
| Simulated annealing | Accept worse moves with probability e−Δ/T where T decreases over time | More exploration early; requires tuning cooling schedule |
| Tabu search | Maintain a list of recently visited graphs; forbid revisiting them | Keeps momentum; needs memory for tabu list |
| Genetic algorithms | Maintain a population; combine (crossover) good graphs, mutate them | Population-based; complex to implement correctly |
Random restarts is the most widely used in practice. It is simple to implement, trivially parallelizable, and works well when the scoring landscape is not too rugged. The textbook recommends running at least 10–20 restarts for small problems.
Simulated annealing for structure search uses the same acceptance rule as the classical algorithm: at temperature T, a move with score change Δ is accepted with probability min(1, eΔ/T). The temperature decreases according to a cooling schedule, typically geometric: Tk+1 = c · Tk where c ∈ (0, 1) is the cooling rate. The initial temperature T0 is typically set so that the acceptance probability for a "typical" bad move is about 0.8 (early in the search, accept almost anything).
python import math, random def simulated_annealing_structure(nodes, score_fn, T0=10.0, cool=0.99, max_iter=5000): """Structure search with simulated annealing.""" edges = set() current_score = score_fn(edges) best_edges, best_score = set(edges), current_score T = T0 for step in range(max_iter): T *= cool # Propose a random operation a, b = random.sample(nodes, 2) new_edges = set(edges) if (a, b) in edges: new_edges.discard((a, b)) # remove elif (b, a) not in edges: new_edges.add((a, b)) # add else: new_edges.discard((b, a)); new_edges.add((a, b)) # reverse if has_cycle(new_edges): continue new_score = score_fn(new_edges) delta = new_score - current_score if delta > 0 or random.random() < math.exp(delta / max(T, 1e-10)): edges, current_score = new_edges, new_score if current_score > best_score: best_edges, best_score = set(edges), current_score return best_edges
Both search a rugged landscape with a local and global optimum. Greedy gets stuck at the local peak. Annealing can escape it. Click to run.
This is the full structure learning experience. You control a 5-node network. There is a hidden "true" graph (shown as ghost edges). The algorithm tries to recover it using hill climbing with restarts. Watch the Bayesian score climb as edges are added, removed, or reversed.
Try clicking "Step" to manually execute one operation. "Auto Run" runs 50 steps. "Restart" starts from a new random graph. The score display shows the current BIC approximation. See how many restarts it takes to find the true structure.
Ghost dashed yellow edges = true structure. Teal = correctly found. Orange = spurious. Adjust temperature to control how often bad moves are accepted.
| Concept | What We Learned | Where It Goes |
|---|---|---|
| Bayesian score | Marginal likelihood that integrates out parameters; built-in Occam's razor | Structure search objective |
| BIC score | Closed-form approximation; no prior pseudocounts needed | Common default in practice |
| K2 search | Greedy given an ordering; fast but ordering-dependent | Fast initial estimate |
| Local search | Hill climbing with add/remove/reverse; order-free | Most widely used approach |
| Markov equivalence | Different DAGs can encode same independences; only v-structures distinguish them | Causal discovery (requires experiments) |
| CPDAG search | Search equivalence classes to avoid score plateaus | PC algorithm, GES algorithm |
| Escaping optima | Restarts, annealing, tabu, genetic algorithms | General heuristic optimization |
The PC algorithm (Spirtes, Glymour & Scheines, 1993) takes a fundamentally different approach from score-based search. Instead of scoring graphs, it performs conditional independence tests on the data and constructs the CPDAG directly.
Under three assumptions — (1) the data-generating distribution is Markov to some DAG, (2) faithfulness (no exact cancellations), and (3) we use consistent independence tests — PC correctly recovers the true CPDAG in the infinite-data limit. With finite data, false positives in independence tests can cause errors.
The GES algorithm (Chickering, 2002) is a score-based method that searches directly in CPDAG space rather than DAG space. It runs in two phases:
| Phase | Operation | Effect |
|---|---|---|
| Forward | Insert: repeatedly add one edge to the CPDAG that maximally improves the score, until no improvement exists | Overfits (finds a dense structure) |
| Backward | Delete: repeatedly remove one edge from the CPDAG that improves the score, until no improvement | Prunes spurious edges |
GES is proven to consistently recover the true CPDAG in the large-sample limit for BIC scoring. Unlike PC, it doesn't need explicit independence tests. In practice, GES often outperforms PC on medium-dimensional datasets.
The Bayesian score embodies this exactly: the simplest model that explains the data wins.
With so many algorithms to choose from, how do you pick? The answer depends on your priorities:
| Priority | Best Algorithm | Reasoning |
|---|---|---|
| Speed with small data | K2 with expert ordering | Fastest; domain experts often know rough causal ordering |
| No prior knowledge available | Local search (hill climbing) with 10+ restarts | Order-free; works on any dataset |
| Statistical rigor (consistency) | GES (Chickering 2002) or PC algorithm | Consistent estimators: recover true CPDAG asymptotically |
| Very large networks (100+ nodes) | FGES (Fast GES) or constraint-based with sparse tests | O(n2) or better; scales to thousands of variables |
| Causal discovery (not just correlation) | PC algorithm + domain expert review of directed edges | PC recovers orientable edges; experts resolve the rest |
| Missing data | EM-based score + local search | Impute missing values inside the scoring loop |
For most real applications, the textbook recommends: (1) run local search with BIC from 10–20 random restarts, taking the best result, (2) validate the learned structure against domain knowledge, (3) flip any edges that domain experts identify as directionally wrong. This practical approach outperforms any single algorithm run once.