GraphRNN — Veanors

Chapter 0: The Problem

Imagine you've studied a thousand protein-protein interaction networks from biology, or a thousand road networks from different cities, or a thousand molecules from a drug database. You want a model that generates new graphs that look like they came from the same distribution — plausible proteins, realistic road layouts, valid molecules.

This is the graph generation problem: given examples of graphs drawn from some unknown distribution, learn that distribution so you can sample new graphs from it.

Why it's hard: Unlike images (grids of pixels) or text (sequences of tokens), graphs have variable size and no canonical ordering. A graph with 10 nodes can be represented as 10! different adjacency matrices depending on how you label the nodes. A generative model must somehow be invariant to this relabeling.

The Landscape Before GraphRNN

Random graph models (Erdős-Rényi, Barabási-Albert): Simple rules like "each edge exists with probability p" or "new nodes prefer high-degree neighbors." These produce graphs with specific properties but can't capture complex real-world structure learned from data.

Variational Autoencoders (VGAE): Encode a graph into a fixed-size latent vector, then decode. But decoding an N×N adjacency matrix all at once is O(N²) parameters — and generating graphs with varying N is awkward. These also struggle with validity constraints.

Generative Adversarial Networks (MolGAN, GraphGAN): Output entire adjacency matrices at once. Training instability plus mode collapse. For large graphs, generating all N² edges simultaneously requires enormous networks.

You et al.'s key insight: treat graph generation as a sequential decision process. Add one node at a time. For each new node, decide which existing nodes to connect it to. This turns graph generation into sequence generation — something RNNs are excellent at.

The Graph Generation Challenge

These 4 graphs all have 6 nodes and 7 edges, but they look completely different. A good generative model must capture which pattern of connectivity is realistic — not just node count.

What is the fundamental challenge that makes graph generation harder than text generation?

Graphs contain too many nodes to process efficiently Graph edges are harder to represent than text tokens Graphs have no canonical node ordering, so the same graph has N! valid matrix representations

Chapter 1: Autoregressive Decomposition

GraphRNN's first contribution is showing that graph generation can be framed as a standard sequence modeling problem — if you choose the right ordering.

Turning a Graph into a Sequence

Fix a node ordering: label the nodes 1, 2, 3, ..., N in some order. Now build the graph incrementally: first add node 1, then add node 2 (and decide whether it connects to node 1), then add node 3 (and decide whether it connects to nodes 1 and 2), and so on.

At step i, you add node i and observe a binary vector S_i of length i−1: S_i[j] = 1 if there is an edge from node i to node j. After all N steps, you've fully specified the graph. The adjacency sequence is (S₁, S₂, ..., S_N).

p(G) = p(S₁, S₂, ..., S_N) = ∏_i=1^N p(S_i | S₁, ..., S_i-1)

This is just the chain rule of probability. Any valid probability distribution over graphs can be decomposed this way. The generative model just needs to learn the conditional distribution p(S_i | S₁, ..., S_i-1) — the probability of the next node's connections given everything added so far.

S_i is itself a sequence: For node i, you need to decide whether it connects to node 1, then node 2, ..., then node i-1. That's i-1 binary decisions. As the graph grows, each new node's adjacency row gets longer. This nested structure — one sequence per row — is what leads to the two-level RNN design.

The Two Sequences

Outer sequence

(S₁, S₂, ..., S_N) — one element per node added, N total steps

↓ each S_i is itself a sequence

Inner sequence

S_i = (s_i,1, s_i,2, ..., s_i,i-1) — one binary bit per potential edge, i−1 total steps

↓ treat each with an RNN

Two-level RNN

Outer RNN tracks "graph so far", Inner RNN generates adjacency row for new node

Why Not a Single Flat Sequence?

You could flatten everything into one long binary sequence: the upper triangle of the adjacency matrix, read row by row. This gives N(N−1)/2 binary decisions. A single RNN over this flat sequence is GraphRNN-S (the simpler baseline in the paper).

The problem: the RNN must maintain a hidden state that captures "which node row am I in?" and "which column within the row?" simultaneously. The two-level architecture makes these responsibilities explicit — the outer RNN handles rows, the inner RNN handles columns within a row.

When GraphRNN generates node 5 (the 5th node added), how many binary decisions does it make for that node's adjacency row?

5 decisions (one for each node including itself) 4 decisions (one for each previously added node: 1, 2, 3, 4) N decisions where N is the final graph size

Chapter 2: Two-Level RNN (Showcase)

The architecture has two RNNs: a graph-level RNN (h-RNN) that tracks the state of the graph as a whole, and an edge-level RNN (e-RNN) that generates the adjacency row for each new node.

Graph-Level RNN (h-RNN)

The h-RNN maintains a hidden state h_i that summarizes all nodes and edges added so far. After generating node i's full adjacency row S_i, the h-RNN updates: h_i+1 = f_trans(h_i, S_i). The final hidden state of the e-RNN (for step i) becomes the input to the h-RNN for the next transition.

Edge-Level RNN (e-RNN)

To generate node i's adjacency row, the e-RNN is initialized with h_i and then runs for i−1 steps. At each step j, it outputs a probability θ_i,j = p(edge from i to j | graph so far, edges i→1, ..., i→(j-1)). You sample s_i,j ∈ {0,1} from Bernoulli(θ_i,j) and feed it as input to the next step.

h^(e)_i,0 = h_i (initialized from h-RNN) θ_i,j = σ(MLP(h^(e)_i,j)) s_i,j ~ Bernoulli(θ_i,j) h^(e)_i,j+1 = e-RNN(h^(e)_i,j, s_i,j)

The information highway: h_i is both the initialization for e-RNN AND the summary that h-RNN reads after e-RNN finishes. All information flows through these hidden states. The e-RNN's final hidden state becomes the new "S_i summary" fed into h-RNN, making h_i+1 aware of which edges were actually added.

Two-Level RNN — Step-by-Step Generation

Watch GraphRNN build a graph node-by-node. The h-RNN state (top bar) updates after each node. The e-RNN (bottom) generates each adjacency row. Click Next Node to add a node.

Press "Next Node" to start generating a graph.

Stopping Criterion

How does the model know when to stop adding nodes? The paper adds a special end-of-sequence (EOS) token. After generating each adjacency row S_i, the h-RNN also outputs a probability p_stop. If sampled as stop, generation ends. During training, this is supervised by the actual graph size.

What does the h-RNN (graph-level RNN) receive as input when transitioning from node i to node i+1?

The adjacency matrix of the entire graph so far Only the degree of node i The final hidden state of the e-RNN after generating node i's adjacency row

Chapter 3: BFS Ordering

A graph with N nodes has N! possible orderings. The autoregressive decomposition requires fixing one ordering — but which? This choice has massive practical consequences.

The Sequence Length Problem

With random ordering, node i's adjacency row has length i−1. For a graph with 1,000 nodes, the last row has 999 entries. Total sequence length: N(N−1)/2 ≈ 500,000 steps. This is far too long for an RNN to handle well — gradients vanish, memory explodes.

But here's the key observation: in most real graphs, nodes have low degree. In a social network with 1,000 nodes, each person is friends with maybe 20 others — not 500. If node i's neighbors all appear before it in the ordering, then most of the i−1 decisions in S_i are trivially 0. We only need to track the entries up to the last 1.

BFS gives locality: In a BFS ordering, a node's neighbors tend to appear nearby in the sequence. This means the last edge in S_i is never far from position i. The "look-back window" needed is small — bounded by the maximum BFS bandwidth M. Instead of sequences of length N(N−1)/2, we get sequences of total length roughly N × M.

The BFS Trick

Run a Breadth-First Search (BFS) from a random start node. This produces a node ordering where nodes tend to be adjacent to nearby nodes in the ordering. Formally, let π be the BFS ordering. For node π(i), the adjacency row only needs to extend back M positions — where M is the BFS "bandwidth" (the maximum number of nodes at any BFS frontier).

For many real-world graphs, M is O(√N) or smaller. This reduces sequence length from O(N²) to O(N√N) — a huge saving that makes training feasible.

BFS vs Random Ordering — Sequence Length

For the same graph, BFS ordering (orange) produces far shorter effective sequences than random ordering (teal). The window M is the key compression factor.

Graph size N 40

Avg degree k 4

Why does BFS ordering reduce the effective sequence length for GraphRNN?

BFS produces graphs with fewer edges, so less to generate BFS merges nodes with identical neighborhoods BFS places connected nodes nearby in the ordering, so each adjacency row only needs a short look-back window M

Chapter 4: Training

Training GraphRNN is teacher forcing: at each step, feed the ground-truth edge (0 or 1) as input — not the model's own sample. This makes training stable. At inference, you switch to sampling from the model's output distribution.

The Loss Function

Each step of the e-RNN outputs a probability θ_i,j ∈ (0,1). The ground-truth is the actual edge s_i,j ∈ {0,1}. The loss is binary cross-entropy summed over all steps:

L = −∑_i∑_j<i [ s_i,j log θ_i,j + (1 − s_i,j) log(1 − θ_i,j) ]

This is equivalent to maximizing the log-likelihood of the training graphs under the autoregressive model. Because most entries in S_i are 0 (sparse graphs), the model quickly learns to predict 0 most of the time — the hard part is getting the 1s right.

Class imbalance: In a sparse graph with N=100 nodes and average degree 4, there are ~200 edges out of ~4,950 possible. The model sees ~97.5% zeros. The BCE loss naturally handles this — predicting all-zeros gives a finite (bad) loss because log(1−1) = −∞ at the true edges. But in practice, you may want to upweight the positive edges.

The Two GraphRNN Variants

GraphRNN-S (Simple)
Uses a single GRU for both levels — just a flat sequence over the upper triangle entries. The hidden state must track both "which row" and "which column within the row" simultaneously.

Parameters: fewer, faster to train. Baseline.

GraphRNN (Full)
Two separate GRUs: h-GRU for the graph level and e-GRU for the edge level. Cleaner decomposition. The h-GRU uses all edge decisions from e-GRU as context.

Parameters: more, better structure capture.

Data Augmentation via Multiple BFS Orderings

Since BFS ordering is not unique (depends on the start node and tie-breaking), each training graph can be converted into several different sequences by running BFS from different start nodes. This provides natural data augmentation and makes the model more robust to ordering choices at inference time.

python
import networkx as nx
import random

def graph_to_bfs_sequence(G):
    # Pick random BFS root
    start = random.choice(list(G.nodes()))
    bfs_order = [n for n, _ in nx.bfs_edges(G, start)]
    bfs_order = [start] + bfs_order

    # Re-index nodes by BFS order
    mapping = {node: i for i, node in enumerate(bfs_order)}
    G_reindexed = nx.relabel_nodes(G, mapping)

    # Extract adjacency rows (upper triangle only)
    sequences = []
    for i in range(len(bfs_order)):
        row = [1 if G_reindexed.has_edge(i, j) else 0
               for j in range(i)]
        sequences.append(row)

    return sequences  # List of binary rows

During GraphRNN training, what is fed as input to the e-RNN at each step (rather than the model's own sample)?

The model's sampled edge (0 or 1) — to simulate inference The ground-truth edge from the training graph — this is called teacher forcing The edge probability θ as a continuous value

Chapter 5: Evaluation with MMD

How do you tell if generated graphs are "realistic"? You can't use standard metrics like accuracy — there's no single correct answer. The paper uses Maximum Mean Discrepancy (MMD), a statistical test for comparing two distributions without requiring explicit density estimation.

What MMD Measures

Given a set of real graphs and a set of generated graphs, MMD measures how different their distributions are. It works by comparing graph-level statistics: degree distributions, clustering coefficient distributions, and orbit count distributions.

MMD²(P, Q) = E_x,x'~P[k(x,x')] − 2E_x~P,y~Q[k(x,y)] + E_y,y'~Q[k(y,y')]

where k is a kernel function (Gaussian in the paper). A lower MMD means the generated distribution is closer to the real one. MMD = 0 would mean perfect replication.

The three statistics used: (1) Degree distribution — does the generated graph have similar degree sequences? (2) Clustering coefficient distribution — how cliquey are nodes? (3) Orbit count distribution — frequency of small subgraph patterns (4-node graphlets). These capture structure at increasing scales.

Why Not Just Visual Inspection?

Graphs are hard to compare visually — two graphs with very different aesthetics can have identical statistical properties. And vice versa: graphs that look similar might have very different clustering or motif structures. MMD over statistics is the honest way to benchmark.

MMD Intuition — Comparing Degree Distributions

Two sets of graphs (real vs generated). Their degree distributions are shown as histograms. A good model has overlapping histograms — low MMD. Adjust the "generation bias" to see how MMD changes.

Degree bias 0.0

MMD ≈ 0.00

What does a lower MMD score between real and generated graphs indicate?

The generated graphs are more statistically similar to the real graphs The generated graphs have fewer edges on average The model was trained for more epochs

Chapter 6: Results

GraphRNN was evaluated on four graph families: Ladder graphs (synthetic, highly regular), 2D grids (synthetic), Barabási-Albert (BA) graphs (scale-free), and protein graphs from the DD dataset (real biological data, up to 500 nodes).

Baselines Compared

Kronecker Graph Model (KronEM) — fits a probabilistic Kronecker product to the adjacency matrix. Assumes a specific parametric form; can't capture arbitrary local structure.

MMSB — Mixed-Membership Stochastic Block Model. Assumes a latent community structure. Good for block-like graphs; struggles with irregular structure.

DeepGMG — concurrent deep generative model using graph neural networks. Also sequential, but uses a GNN to compute state instead of an RNN.

GraphVAE — VAE over adjacency matrices. Limited to small fixed-size graphs. Can't scale to 500-node proteins.

Key Numbers

Method	Ladder (Deg MMD)	Grid (Clust MMD)	Protein (Orbit MMD)
Kronecker	0.058	0.231	0.426
GraphVAE	0.350	0.061	—
DeepGMG	0.040	0.220	—
GraphRNN	0.001	0.003	0.040

BFS ordering wins decisively: The full GraphRNN model with BFS ordering outperforms GraphRNN-S (the simple flat-sequence version) and all baselines on nearly every metric. The BFS trick is not just an efficiency optimization — it actually improves quality by making the learning problem easier (shorter effective sequences, clearer local structure).

Scalability

GraphRNN trains on graphs with up to 500 nodes (protein graphs). GraphVAE is limited to ~40 nodes (O(N²) parameters in decoder). KronEM fits a fixed-size template. GraphRNN scales because sequence length grows linearly in N (with BFS truncation) rather than quadratically.

On which graph type does GraphRNN show the most dramatic improvement over baselines?

Ladder and Grid graphs — GraphRNN achieves near-zero MMD while baselines have 0.05–0.35 Barabási-Albert graphs — where scale-free structure is hardest to capture All methods perform similarly; GraphRNN is just more scalable

Chapter 7: Connections & Beyond

GraphRNN established the autoregressive paradigm for graph generation. Many subsequent papers build on or react to it.

Limitations of GraphRNN

No node features: GraphRNN generates only graph topology — it doesn't produce feature vectors for nodes. For molecules, you need atom types, charges, bond types. The paper doesn't address this.

No validity constraints: A generated graph might violate domain rules (e.g., valence limits in molecules). GraphRNN doesn't know or enforce these — pure statistical generation.

Ordering sensitivity: Despite BFS, the model still depends on the ordering. Different orderings give different conditional factorizations, and averaging over all orderings during training would be intractable.

Slow sampling: O(N²) sequential steps to generate one graph. Each step requires an RNN forward pass. For large N, this is slow.

What Came Next

Paper	Key Idea	Addresses
GCPN (You et al., 2018)	RL-guided generation with GCN policy	Property optimization, validity
JT-VAE (Jin et al., 2018)	Hierarchical junction tree decomposition	Validity (always), node features
GRAN (Liao et al., 2019)	Graph-recurrent attention network, faster	Speed, scalability
DiGress (Vignac et al., 2022)	Discrete diffusion over graphs	Non-sequential, parallel generation

The lasting contribution: GraphRNN proved that graph generation can be reduced to sequence generation with the right decomposition — and that this gives dramatically better empirical results than previous approaches. The BFS ordering insight is independently useful and has been adopted or adapted in many follow-up works.

Related Lessons

This paper connects to the following GNN papers in Veanors:

GCN — the foundational graph convolution that GCPN builds on
GCPN — adds RL and chemical validity to graph generation
JT-VAE — hierarchical molecular generation with guaranteed validity
GAT — attention-based GNN, used in later graph generators

"To generate complex structured objects, decompose into sequential decisions — then the rich machinery of sequence models is at your disposal."
— Paraphrase of the GraphRNN design philosophy

GraphRNN: Generating Realistic Graphs with Auto-Regressive Models