You et al. — Stanford University, ICML 2018

GraphRNN: Generating Realistic Graphs with Auto-Regressive Models

Two nested RNNs that grow a graph node-by-node, edge-by-edge — treating graph generation as a sequence modeling problem.

Prerequisites: RNNs / LSTMs + Graph basics + Probability distributions
8
Chapters
4+
Simulations

Chapter 0: The Problem

Imagine you've studied a thousand protein-protein interaction networks from biology, or a thousand road networks from different cities, or a thousand molecules from a drug database. You want a model that generates new graphs that look like they came from the same distribution — plausible proteins, realistic road layouts, valid molecules.

This is the graph generation problem: given examples of graphs drawn from some unknown distribution, learn that distribution so you can sample new graphs from it.

Why it's hard: Unlike images (grids of pixels) or text (sequences of tokens), graphs have variable size and no canonical ordering. A graph with 10 nodes can be represented as 10! different adjacency matrices depending on how you label the nodes. A generative model must somehow be invariant to this relabeling.

The Landscape Before GraphRNN

Random graph models (Erdős-Rényi, Barabási-Albert): Simple rules like "each edge exists with probability p" or "new nodes prefer high-degree neighbors." These produce graphs with specific properties but can't capture complex real-world structure learned from data.

Variational Autoencoders (VGAE): Encode a graph into a fixed-size latent vector, then decode. But decoding an N×N adjacency matrix all at once is O(N²) parameters — and generating graphs with varying N is awkward. These also struggle with validity constraints.

Generative Adversarial Networks (MolGAN, GraphGAN): Output entire adjacency matrices at once. Training instability plus mode collapse. For large graphs, generating all N² edges simultaneously requires enormous networks.

You et al.'s key insight: treat graph generation as a sequential decision process. Add one node at a time. For each new node, decide which existing nodes to connect it to. This turns graph generation into sequence generation — something RNNs are excellent at.

The Graph Generation Challenge

These 4 graphs all have 6 nodes and 7 edges, but they look completely different. A good generative model must capture which pattern of connectivity is realistic — not just node count.

What is the fundamental challenge that makes graph generation harder than text generation?

Chapter 1: Autoregressive Decomposition

GraphRNN's first contribution is showing that graph generation can be framed as a standard sequence modeling problem — if you choose the right ordering.

Turning a Graph into a Sequence

Fix a node ordering: label the nodes 1, 2, 3, ..., N in some order. Now build the graph incrementally: first add node 1, then add node 2 (and decide whether it connects to node 1), then add node 3 (and decide whether it connects to nodes 1 and 2), and so on.

At step i, you add node i and observe a binary vector Si of length i−1: Si[j] = 1 if there is an edge from node i to node j. After all N steps, you've fully specified the graph. The adjacency sequence is (S1, S2, ..., SN).

p(G) = p(S1, S2, ..., SN) = ∏i=1N p(Si | S1, ..., Si-1)

This is just the chain rule of probability. Any valid probability distribution over graphs can be decomposed this way. The generative model just needs to learn the conditional distribution p(Si | S1, ..., Si-1) — the probability of the next node's connections given everything added so far.

Si is itself a sequence: For node i, you need to decide whether it connects to node 1, then node 2, ..., then node i-1. That's i-1 binary decisions. As the graph grows, each new node's adjacency row gets longer. This nested structure — one sequence per row — is what leads to the two-level RNN design.

The Two Sequences

Outer sequence
(S1, S2, ..., SN) — one element per node added, N total steps
↓ each Si is itself a sequence
Inner sequence
Si = (si,1, si,2, ..., si,i-1) — one binary bit per potential edge, i−1 total steps
↓ treat each with an RNN
Two-level RNN
Outer RNN tracks "graph so far", Inner RNN generates adjacency row for new node

Why Not a Single Flat Sequence?

You could flatten everything into one long binary sequence: the upper triangle of the adjacency matrix, read row by row. This gives N(N−1)/2 binary decisions. A single RNN over this flat sequence is GraphRNN-S (the simpler baseline in the paper).

The problem: the RNN must maintain a hidden state that captures "which node row am I in?" and "which column within the row?" simultaneously. The two-level architecture makes these responsibilities explicit — the outer RNN handles rows, the inner RNN handles columns within a row.

When GraphRNN generates node 5 (the 5th node added), how many binary decisions does it make for that node's adjacency row?

Chapter 2: Two-Level RNN (Showcase)

The architecture has two RNNs: a graph-level RNN (h-RNN) that tracks the state of the graph as a whole, and an edge-level RNN (e-RNN) that generates the adjacency row for each new node.

Graph-Level RNN (h-RNN)

The h-RNN maintains a hidden state hi that summarizes all nodes and edges added so far. After generating node i's full adjacency row Si, the h-RNN updates: hi+1 = ftrans(hi, Si). The final hidden state of the e-RNN (for step i) becomes the input to the h-RNN for the next transition.

Edge-Level RNN (e-RNN)

To generate node i's adjacency row, the e-RNN is initialized with hi and then runs for i−1 steps. At each step j, it outputs a probability θi,j = p(edge from i to j | graph so far, edges i→1, ..., i→(j-1)). You sample si,j ∈ {0,1} from Bernoulli(θi,j) and feed it as input to the next step.

h(e)i,0 = hi (initialized from h-RNN) θi,j = σ(MLP(h(e)i,j)) si,j ~ Bernoulli(θi,j) h(e)i,j+1 = e-RNN(h(e)i,j, si,j)
The information highway: hi is both the initialization for e-RNN AND the summary that h-RNN reads after e-RNN finishes. All information flows through these hidden states. The e-RNN's final hidden state becomes the new "Si summary" fed into h-RNN, making hi+1 aware of which edges were actually added.
Two-Level RNN — Step-by-Step Generation

Watch GraphRNN build a graph node-by-node. The h-RNN state (top bar) updates after each node. The e-RNN (bottom) generates each adjacency row. Click Next Node to add a node.

Press "Next Node" to start generating a graph.

Stopping Criterion

How does the model know when to stop adding nodes? The paper adds a special end-of-sequence (EOS) token. After generating each adjacency row Si, the h-RNN also outputs a probability pstop. If sampled as stop, generation ends. During training, this is supervised by the actual graph size.

What does the h-RNN (graph-level RNN) receive as input when transitioning from node i to node i+1?

Chapter 3: BFS Ordering

A graph with N nodes has N! possible orderings. The autoregressive decomposition requires fixing one ordering — but which? This choice has massive practical consequences.

The Sequence Length Problem

With random ordering, node i's adjacency row has length i−1. For a graph with 1,000 nodes, the last row has 999 entries. Total sequence length: N(N−1)/2 ≈ 500,000 steps. This is far too long for an RNN to handle well — gradients vanish, memory explodes.

But here's the key observation: in most real graphs, nodes have low degree. In a social network with 1,000 nodes, each person is friends with maybe 20 others — not 500. If node i's neighbors all appear before it in the ordering, then most of the i−1 decisions in Si are trivially 0. We only need to track the entries up to the last 1.

BFS gives locality: In a BFS ordering, a node's neighbors tend to appear nearby in the sequence. This means the last edge in Si is never far from position i. The "look-back window" needed is small — bounded by the maximum BFS bandwidth M. Instead of sequences of length N(N−1)/2, we get sequences of total length roughly N × M.

The BFS Trick

Run a Breadth-First Search (BFS) from a random start node. This produces a node ordering where nodes tend to be adjacent to nearby nodes in the ordering. Formally, let π be the BFS ordering. For node π(i), the adjacency row only needs to extend back M positions — where M is the BFS "bandwidth" (the maximum number of nodes at any BFS frontier).

For many real-world graphs, M is O(√N) or smaller. This reduces sequence length from O(N²) to O(N√N) — a huge saving that makes training feasible.

BFS vs Random Ordering — Sequence Length

For the same graph, BFS ordering (orange) produces far shorter effective sequences than random ordering (teal). The window M is the key compression factor.

Graph size N 40
Avg degree k 4
Why does BFS ordering reduce the effective sequence length for GraphRNN?

Chapter 4: Training

Training GraphRNN is teacher forcing: at each step, feed the ground-truth edge (0 or 1) as input — not the model's own sample. This makes training stable. At inference, you switch to sampling from the model's output distribution.

The Loss Function

Each step of the e-RNN outputs a probability θi,j ∈ (0,1). The ground-truth is the actual edge si,j ∈ {0,1}. The loss is binary cross-entropy summed over all steps:

L = −∑ij<i [ si,j log θi,j + (1 − si,j) log(1 − θi,j) ]

This is equivalent to maximizing the log-likelihood of the training graphs under the autoregressive model. Because most entries in Si are 0 (sparse graphs), the model quickly learns to predict 0 most of the time — the hard part is getting the 1s right.

Class imbalance: In a sparse graph with N=100 nodes and average degree 4, there are ~200 edges out of ~4,950 possible. The model sees ~97.5% zeros. The BCE loss naturally handles this — predicting all-zeros gives a finite (bad) loss because log(1−1) = −∞ at the true edges. But in practice, you may want to upweight the positive edges.

The Two GraphRNN Variants

GraphRNN-S (Simple)
Uses a single GRU for both levels — just a flat sequence over the upper triangle entries. The hidden state must track both "which row" and "which column within the row" simultaneously.

Parameters: fewer, faster to train. Baseline.
GraphRNN (Full)
Two separate GRUs: h-GRU for the graph level and e-GRU for the edge level. Cleaner decomposition. The h-GRU uses all edge decisions from e-GRU as context.

Parameters: more, better structure capture.

Data Augmentation via Multiple BFS Orderings

Since BFS ordering is not unique (depends on the start node and tie-breaking), each training graph can be converted into several different sequences by running BFS from different start nodes. This provides natural data augmentation and makes the model more robust to ordering choices at inference time.

python
import networkx as nx
import random

def graph_to_bfs_sequence(G):
    # Pick random BFS root
    start = random.choice(list(G.nodes()))
    bfs_order = [n for n, _ in nx.bfs_edges(G, start)]
    bfs_order = [start] + bfs_order

    # Re-index nodes by BFS order
    mapping = {node: i for i, node in enumerate(bfs_order)}
    G_reindexed = nx.relabel_nodes(G, mapping)

    # Extract adjacency rows (upper triangle only)
    sequences = []
    for i in range(len(bfs_order)):
        row = [1 if G_reindexed.has_edge(i, j) else 0
               for j in range(i)]
        sequences.append(row)

    return sequences  # List of binary rows
During GraphRNN training, what is fed as input to the e-RNN at each step (rather than the model's own sample)?

Chapter 5: Evaluation with MMD

How do you tell if generated graphs are "realistic"? You can't use standard metrics like accuracy — there's no single correct answer. The paper uses Maximum Mean Discrepancy (MMD), a statistical test for comparing two distributions without requiring explicit density estimation.

What MMD Measures

Given a set of real graphs and a set of generated graphs, MMD measures how different their distributions are. It works by comparing graph-level statistics: degree distributions, clustering coefficient distributions, and orbit count distributions.

MMD2(P, Q) = Ex,x'~P[k(x,x')] − 2Ex~P,y~Q[k(x,y)] + Ey,y'~Q[k(y,y')]

where k is a kernel function (Gaussian in the paper). A lower MMD means the generated distribution is closer to the real one. MMD = 0 would mean perfect replication.

The three statistics used: (1) Degree distribution — does the generated graph have similar degree sequences? (2) Clustering coefficient distribution — how cliquey are nodes? (3) Orbit count distribution — frequency of small subgraph patterns (4-node graphlets). These capture structure at increasing scales.

Why Not Just Visual Inspection?

Graphs are hard to compare visually — two graphs with very different aesthetics can have identical statistical properties. And vice versa: graphs that look similar might have very different clustering or motif structures. MMD over statistics is the honest way to benchmark.

MMD Intuition — Comparing Degree Distributions

Two sets of graphs (real vs generated). Their degree distributions are shown as histograms. A good model has overlapping histograms — low MMD. Adjust the "generation bias" to see how MMD changes.

Degree bias 0.0
MMD ≈ 0.00
What does a lower MMD score between real and generated graphs indicate?

Chapter 6: Results

GraphRNN was evaluated on four graph families: Ladder graphs (synthetic, highly regular), 2D grids (synthetic), Barabási-Albert (BA) graphs (scale-free), and protein graphs from the DD dataset (real biological data, up to 500 nodes).

Baselines Compared

Kronecker Graph Model (KronEM) — fits a probabilistic Kronecker product to the adjacency matrix. Assumes a specific parametric form; can't capture arbitrary local structure.

MMSB — Mixed-Membership Stochastic Block Model. Assumes a latent community structure. Good for block-like graphs; struggles with irregular structure.
DeepGMG — concurrent deep generative model using graph neural networks. Also sequential, but uses a GNN to compute state instead of an RNN.

GraphVAE — VAE over adjacency matrices. Limited to small fixed-size graphs. Can't scale to 500-node proteins.

Key Numbers

MethodLadder (Deg MMD)Grid (Clust MMD)Protein (Orbit MMD)
Kronecker0.0580.2310.426
GraphVAE0.3500.061
DeepGMG0.0400.220
GraphRNN0.0010.0030.040
BFS ordering wins decisively: The full GraphRNN model with BFS ordering outperforms GraphRNN-S (the simple flat-sequence version) and all baselines on nearly every metric. The BFS trick is not just an efficiency optimization — it actually improves quality by making the learning problem easier (shorter effective sequences, clearer local structure).

Scalability

GraphRNN trains on graphs with up to 500 nodes (protein graphs). GraphVAE is limited to ~40 nodes (O(N²) parameters in decoder). KronEM fits a fixed-size template. GraphRNN scales because sequence length grows linearly in N (with BFS truncation) rather than quadratically.

On which graph type does GraphRNN show the most dramatic improvement over baselines?

Chapter 7: Connections & Beyond

GraphRNN established the autoregressive paradigm for graph generation. Many subsequent papers build on or react to it.

Limitations of GraphRNN

No node features: GraphRNN generates only graph topology — it doesn't produce feature vectors for nodes. For molecules, you need atom types, charges, bond types. The paper doesn't address this.

No validity constraints: A generated graph might violate domain rules (e.g., valence limits in molecules). GraphRNN doesn't know or enforce these — pure statistical generation.

Ordering sensitivity: Despite BFS, the model still depends on the ordering. Different orderings give different conditional factorizations, and averaging over all orderings during training would be intractable.

Slow sampling: O(N²) sequential steps to generate one graph. Each step requires an RNN forward pass. For large N, this is slow.

What Came Next

PaperKey IdeaAddresses
GCPN (You et al., 2018)RL-guided generation with GCN policyProperty optimization, validity
JT-VAE (Jin et al., 2018)Hierarchical junction tree decompositionValidity (always), node features
GRAN (Liao et al., 2019)Graph-recurrent attention network, fasterSpeed, scalability
DiGress (Vignac et al., 2022)Discrete diffusion over graphsNon-sequential, parallel generation
The lasting contribution: GraphRNN proved that graph generation can be reduced to sequence generation with the right decomposition — and that this gives dramatically better empirical results than previous approaches. The BFS ordering insight is independently useful and has been adopted or adapted in many follow-up works.

Related Lessons

This paper connects to the following GNN papers in Veanors:

"To generate complex structured objects, decompose into sequential decisions — then the rich machinery of sequence models is at your disposal."
— Paraphrase of the GraphRNN design philosophy