Two nested RNNs that grow a graph node-by-node, edge-by-edge — treating graph generation as a sequence modeling problem.
Imagine you've studied a thousand protein-protein interaction networks from biology, or a thousand road networks from different cities, or a thousand molecules from a drug database. You want a model that generates new graphs that look like they came from the same distribution — plausible proteins, realistic road layouts, valid molecules.
This is the graph generation problem: given examples of graphs drawn from some unknown distribution, learn that distribution so you can sample new graphs from it.
Random graph models (Erdős-Rényi, Barabási-Albert): Simple rules like "each edge exists with probability p" or "new nodes prefer high-degree neighbors." These produce graphs with specific properties but can't capture complex real-world structure learned from data.
Variational Autoencoders (VGAE): Encode a graph into a fixed-size latent vector, then decode. But decoding an N×N adjacency matrix all at once is O(N²) parameters — and generating graphs with varying N is awkward. These also struggle with validity constraints.
Generative Adversarial Networks (MolGAN, GraphGAN): Output entire adjacency matrices at once. Training instability plus mode collapse. For large graphs, generating all N² edges simultaneously requires enormous networks.
You et al.'s key insight: treat graph generation as a sequential decision process. Add one node at a time. For each new node, decide which existing nodes to connect it to. This turns graph generation into sequence generation — something RNNs are excellent at.
These 4 graphs all have 6 nodes and 7 edges, but they look completely different. A good generative model must capture which pattern of connectivity is realistic — not just node count.
GraphRNN's first contribution is showing that graph generation can be framed as a standard sequence modeling problem — if you choose the right ordering.
Fix a node ordering: label the nodes 1, 2, 3, ..., N in some order. Now build the graph incrementally: first add node 1, then add node 2 (and decide whether it connects to node 1), then add node 3 (and decide whether it connects to nodes 1 and 2), and so on.
At step i, you add node i and observe a binary vector Si of length i−1: Si[j] = 1 if there is an edge from node i to node j. After all N steps, you've fully specified the graph. The adjacency sequence is (S1, S2, ..., SN).
This is just the chain rule of probability. Any valid probability distribution over graphs can be decomposed this way. The generative model just needs to learn the conditional distribution p(Si | S1, ..., Si-1) — the probability of the next node's connections given everything added so far.
You could flatten everything into one long binary sequence: the upper triangle of the adjacency matrix, read row by row. This gives N(N−1)/2 binary decisions. A single RNN over this flat sequence is GraphRNN-S (the simpler baseline in the paper).
The problem: the RNN must maintain a hidden state that captures "which node row am I in?" and "which column within the row?" simultaneously. The two-level architecture makes these responsibilities explicit — the outer RNN handles rows, the inner RNN handles columns within a row.
The architecture has two RNNs: a graph-level RNN (h-RNN) that tracks the state of the graph as a whole, and an edge-level RNN (e-RNN) that generates the adjacency row for each new node.
The h-RNN maintains a hidden state hi that summarizes all nodes and edges added so far. After generating node i's full adjacency row Si, the h-RNN updates: hi+1 = ftrans(hi, Si). The final hidden state of the e-RNN (for step i) becomes the input to the h-RNN for the next transition.
To generate node i's adjacency row, the e-RNN is initialized with hi and then runs for i−1 steps. At each step j, it outputs a probability θi,j = p(edge from i to j | graph so far, edges i→1, ..., i→(j-1)). You sample si,j ∈ {0,1} from Bernoulli(θi,j) and feed it as input to the next step.
Watch GraphRNN build a graph node-by-node. The h-RNN state (top bar) updates after each node. The e-RNN (bottom) generates each adjacency row. Click Next Node to add a node.
How does the model know when to stop adding nodes? The paper adds a special end-of-sequence (EOS) token. After generating each adjacency row Si, the h-RNN also outputs a probability pstop. If sampled as stop, generation ends. During training, this is supervised by the actual graph size.
A graph with N nodes has N! possible orderings. The autoregressive decomposition requires fixing one ordering — but which? This choice has massive practical consequences.
With random ordering, node i's adjacency row has length i−1. For a graph with 1,000 nodes, the last row has 999 entries. Total sequence length: N(N−1)/2 ≈ 500,000 steps. This is far too long for an RNN to handle well — gradients vanish, memory explodes.
But here's the key observation: in most real graphs, nodes have low degree. In a social network with 1,000 nodes, each person is friends with maybe 20 others — not 500. If node i's neighbors all appear before it in the ordering, then most of the i−1 decisions in Si are trivially 0. We only need to track the entries up to the last 1.
Run a Breadth-First Search (BFS) from a random start node. This produces a node ordering where nodes tend to be adjacent to nearby nodes in the ordering. Formally, let π be the BFS ordering. For node π(i), the adjacency row only needs to extend back M positions — where M is the BFS "bandwidth" (the maximum number of nodes at any BFS frontier).
For many real-world graphs, M is O(√N) or smaller. This reduces sequence length from O(N²) to O(N√N) — a huge saving that makes training feasible.
For the same graph, BFS ordering (orange) produces far shorter effective sequences than random ordering (teal). The window M is the key compression factor.
Training GraphRNN is teacher forcing: at each step, feed the ground-truth edge (0 or 1) as input — not the model's own sample. This makes training stable. At inference, you switch to sampling from the model's output distribution.
Each step of the e-RNN outputs a probability θi,j ∈ (0,1). The ground-truth is the actual edge si,j ∈ {0,1}. The loss is binary cross-entropy summed over all steps:
This is equivalent to maximizing the log-likelihood of the training graphs under the autoregressive model. Because most entries in Si are 0 (sparse graphs), the model quickly learns to predict 0 most of the time — the hard part is getting the 1s right.
Since BFS ordering is not unique (depends on the start node and tie-breaking), each training graph can be converted into several different sequences by running BFS from different start nodes. This provides natural data augmentation and makes the model more robust to ordering choices at inference time.
python import networkx as nx import random def graph_to_bfs_sequence(G): # Pick random BFS root start = random.choice(list(G.nodes())) bfs_order = [n for n, _ in nx.bfs_edges(G, start)] bfs_order = [start] + bfs_order # Re-index nodes by BFS order mapping = {node: i for i, node in enumerate(bfs_order)} G_reindexed = nx.relabel_nodes(G, mapping) # Extract adjacency rows (upper triangle only) sequences = [] for i in range(len(bfs_order)): row = [1 if G_reindexed.has_edge(i, j) else 0 for j in range(i)] sequences.append(row) return sequences # List of binary rows
How do you tell if generated graphs are "realistic"? You can't use standard metrics like accuracy — there's no single correct answer. The paper uses Maximum Mean Discrepancy (MMD), a statistical test for comparing two distributions without requiring explicit density estimation.
Given a set of real graphs and a set of generated graphs, MMD measures how different their distributions are. It works by comparing graph-level statistics: degree distributions, clustering coefficient distributions, and orbit count distributions.
where k is a kernel function (Gaussian in the paper). A lower MMD means the generated distribution is closer to the real one. MMD = 0 would mean perfect replication.
Graphs are hard to compare visually — two graphs with very different aesthetics can have identical statistical properties. And vice versa: graphs that look similar might have very different clustering or motif structures. MMD over statistics is the honest way to benchmark.
Two sets of graphs (real vs generated). Their degree distributions are shown as histograms. A good model has overlapping histograms — low MMD. Adjust the "generation bias" to see how MMD changes.
GraphRNN was evaluated on four graph families: Ladder graphs (synthetic, highly regular), 2D grids (synthetic), Barabási-Albert (BA) graphs (scale-free), and protein graphs from the DD dataset (real biological data, up to 500 nodes).
| Method | Ladder (Deg MMD) | Grid (Clust MMD) | Protein (Orbit MMD) |
|---|---|---|---|
| Kronecker | 0.058 | 0.231 | 0.426 |
| GraphVAE | 0.350 | 0.061 | — |
| DeepGMG | 0.040 | 0.220 | — |
| GraphRNN | 0.001 | 0.003 | 0.040 |
GraphRNN trains on graphs with up to 500 nodes (protein graphs). GraphVAE is limited to ~40 nodes (O(N²) parameters in decoder). KronEM fits a fixed-size template. GraphRNN scales because sequence length grows linearly in N (with BFS truncation) rather than quadratically.
GraphRNN established the autoregressive paradigm for graph generation. Many subsequent papers build on or react to it.
No node features: GraphRNN generates only graph topology — it doesn't produce feature vectors for nodes. For molecules, you need atom types, charges, bond types. The paper doesn't address this.
No validity constraints: A generated graph might violate domain rules (e.g., valence limits in molecules). GraphRNN doesn't know or enforce these — pure statistical generation.
Ordering sensitivity: Despite BFS, the model still depends on the ordering. Different orderings give different conditional factorizations, and averaging over all orderings during training would be intractable.
Slow sampling: O(N²) sequential steps to generate one graph. Each step requires an RNN forward pass. For large N, this is slow.
| Paper | Key Idea | Addresses |
|---|---|---|
| GCPN (You et al., 2018) | RL-guided generation with GCN policy | Property optimization, validity |
| JT-VAE (Jin et al., 2018) | Hierarchical junction tree decomposition | Validity (always), node features |
| GRAN (Liao et al., 2019) | Graph-recurrent attention network, faster | Speed, scalability |
| DiGress (Vignac et al., 2022) | Discrete diffusion over graphs | Non-sequential, parallel generation |
This paper connects to the following GNN papers in Veanors:
"To generate complex structured objects, decompose into sequential decisions — then the rich machinery of sequence models is at your disposal."
— Paraphrase of the GraphRNN design philosophy