You can't just enumerate all valid drug molecules — there are 1060 of them. Generative models learn the distribution of real graphs, then sample new ones with desired properties: potent, soluble, synthesizable.
Drug discovery costs $2.6 billion and 12 years to bring one drug to market. The bottleneck isn't synthesis or trials — it's finding the right molecule to test. The chemical space has an estimated 1060 drug-like molecules. We can't enumerate them. We can't randomly sample them — only 1 in 1014 random molecules is drug-like at all. We need to generate molecules with desired properties on demand.
This is the core problem: given a distribution over "good" graphs (molecules, social networks, protein interaction networks), learn a model that can sample new graphs from that distribution — and ideally, sample graphs with a specific target property (high binding affinity, low toxicity, synthesizable).
Generating images is hard but well-studied: pixels form a 2D grid, output is a fixed-size matrix, order is canonical (row by row). Generating graphs is harder on every axis:
Visualizing the chemical space. Random sampling almost never hits valid drug-like molecules. Generative models learn to sample from the valid, drug-like region directly.
The most natural approach: build the graph one piece at a time. Add a node. Add its edges to existing nodes. Add another node. Repeat until done. This converts graph generation into a sequence generation problem — and we have powerful models (RNNs, Transformers) for sequences.
The key design choice is the generation order: in what order do we add nodes and edges? Unlike images (natural row-by-row order), graphs have no inherent order. We must choose one — and the choice affects how complex the dependencies become.
Sequential graph generation (pseudocode) def generate_graph(model): G = Graph() # start with empty graph h = model.initial_state() # RNN hidden state while not model.stop(h): # Decide whether to add a new node node_type = model.sample_node_type(h) v = G.add_node(node_type) # For each existing node, decide edge existence for u in G.nodes: p_edge = model.edge_prob(h, u, v) if sample(p_edge): G.add_edge(u, v) h = model.update(h, G) # update state with new graph return G
Watch how different strategies add nodes and edges. Notice how node-by-node requires quadratic edge decisions while edge-by-edge is linear in edges.
You et al. (2018) built the first scalable deep graph generator. The insight: use a node-level RNN to generate one node at a time, and for each new node, use an edge-level RNN to generate its connections to all previous nodes. Two RNNs — one for the node sequence, one for each node's edge row — nested inside each other.
During training, the model sees real graphs. At each step, the ground-truth adjacency row is fed as the "observed sequence" for the edge RNN. The loss is binary cross-entropy between predicted edge probabilities and ground-truth edges. At generation time, the model samples from its own predictions — a process called autoregressive sampling.
Where at,i ∈ {0,1} is the ground-truth edge between node t and node t-i, and pt,i is the predicted probability. This is exactly binary cross-entropy summed over all (node, position) pairs.
Click "Add Node" to add one node at a time. For each new node, the edge RNN decides which previous nodes to connect. Watch the graph build up step by step. Use the density slider to control connectivity.
How do you know if your generated graphs are any good? You can't compare them graph-by-graph — the generated graphs are new, not reconstructions of training graphs. You need to compare distributions of graphs. This is the graph evaluation problem, and it's surprisingly hard.
The standard approach: compute a set of graph-level statistics on a large sample of generated graphs, and compare their distributions to the same statistics on real graphs. If the distributions match, the generative model has learned the real distribution.
MMD measures the distance between two distributions by comparing their mean embeddings in a kernel feature space. For graph evaluation, you compute a graph statistic (e.g., the degree distribution) for each generated graph, embed these statistics in a kernel space, and compute MMD against the real graph statistics.
Where k is a kernel function (Gaussian RBF is common) and P, Q are the real/generated distributions over graph statistics. MMD = 0 means the distributions are identical; larger MMD means larger distributional gap.
For molecules specifically, FCD (Fréchet ChemNet Distance) embeds molecules using a pretrained chemical neural network, computes Fréchet distance between the real and generated molecule embeddings. This captures chemical validity and similarity simultaneously — not just structural properties but chemical semantics.
| Metric | What it measures | Range | Better = |
|---|---|---|---|
| MMD (degree) | Degree distribution match | [0, ∞) | Lower |
| MMD (clustering) | Triangle density match | [0, ∞) | Lower |
| Validity | Fraction of valid molecules | [0, 1] | Higher |
| Uniqueness | Fraction of non-duplicate valid molecules | [0, 1] | Higher |
| Novelty | Fraction not in training set | [0, 1] | Higher |
| FCD | Chemical distribution distance | [0, ∞) | Lower |
Comparing degree distributions of real graphs (warm) vs. generated graphs (teal). A good generator should overlay almost perfectly. Click to see different model quality levels.
GraphRNN learns to generate graphs that look like training graphs. But drug discovery needs more: generate a molecule with specific target properties — high QED (drug-likeness score), low toxicity, high binding affinity. This is goal-directed generation, and it requires optimization, not just sampling.
GCPN (Graph Convolutional Policy Network, You et al. 2018) frames molecule generation as an MDP and uses RL to optimize molecular properties while maintaining chemical validity.
GCPN reward function (simplified) def compute_reward(mol, step, final): r = 0.0 # Step-level: penalize valence violations if has_valence_violation(mol): r -= 0.1 if final: # Molecular property reward (e.g., penalized logP) r += 1.0 * penalized_logp(mol) # Adversarial reward (discriminator output) r += 0.5 * discriminator.score(mol) # Similarity constraint (don't stray too far from valid space) if is_valid(mol): r += 0.5 return r
Watch how the molecule's penalized logP score improves as GCPN trains over episodes. Early molecules are random; later ones are specifically optimized. The adversarial reward prevents runaway optimization.
GraphRNN and GCPN generate atoms one at a time. But chemists don't think about molecules atom by atom — they think about scaffolds and functional groups. Benzene rings. Carboxyl groups. Amide bonds. These substructures appear repeatedly across drug-like molecules. Junction Tree VAE (JT-VAE, Jin et al. 2018) generates molecules at this higher level of abstraction.
Every molecule can be decomposed into a junction tree — a tree where each node is a ring or bond (a "motif"), and edges represent how motifs are fused. The molecule is assembled by combining motifs according to this tree structure.
Where T is the junction tree, G is the full molecule, and z is the shared latent vector. The first term rewards accurate tree reconstruction, the second rewards accurate graph assembly, and the KL term regularizes the latent space to be smooth (so we can interpolate between molecules).
Click a molecule to see how it decomposes into motifs (rings and bonds) forming a junction tree. The tree structure is much simpler than the full atom graph.
Score-based diffusion models dominated image generation (DDPM, Stable Diffusion). The same idea applies to graphs: add noise to a real graph until it becomes a random graph, then learn to reverse the process — starting from random noise and gradually denoising into a valid graph.
For images, noise means adding Gaussian noise to pixel values. For graphs, there are two noise models:
The reverse process pθ(Gt-1 | Gt) is learned by a GNN that sees the noisy graph Gt and predicts either the denoised graph G0 (x-prediction) or the noise ε (noise-prediction). Training minimizes the expected denoising error over all noise levels.
GDSS (Jo et al. 2022) uses continuous stochastic differential equations for both node features and adjacency simultaneously. DiGress (Vignac et al. 2022) uses discrete diffusion — categorical node/edge type distributions that transition via a Markov matrix at each step. DiGress achieves state-of-the-art on molecular generation benchmarks.
DiGress: discrete diffusion forward process import torch # Q_t: transition matrix for discrete edge types at timestep t # Q_t[i,j] = probability of type i transitioning to type j def get_Qt(t, T=1000): alpha_t = 1.0 - t / T # noise schedule (linear) # Absorbing state: with prob (1-alpha_t), jump to "noise" type Qt = alpha_t * torch.eye(num_edge_types) + \ (1 - alpha_t) * torch.ones(num_edge_types, num_edge_types) / num_edge_types return Qt # Marginal at time t: E_t = E_0 @ Qt_bar (product of Qt's) # Denoising: GNN predicts E_0 given noisy graph E_t
Drag the slider to see a graph at different noise levels. Left = clean molecular graph. Right = fully noised (Erdős–Rényi). Generation runs right to left.
Molecular generation has standardized benchmarks that measure whether generated molecules are actually useful for drug discovery — not just structurally valid, but chemically desirable. The two canonical property scores are QED and penalized logP.
QED scores a molecule from 0 to 1 based on how closely it matches properties of known oral drugs: molecular weight (200-500 Da), lipophilicity (logP 0-5), polar surface area, number of hydrogen bond donors/acceptors, and absence of known problematic groups. QED = 1 means the molecule looks exactly like known drugs.
Where di(mol) is the desirability score for property i (using predefined bell curves matching known drug distributions) and wi are fixed weights. This is a weighted geometric mean of desirability scores.
Raw logP (octanol-water partition coefficient) measures lipophilicity — how much a drug can cross cell membranes. But high-logP molecules tend to be too greasy, poorly soluble, and toxic. Penalized logP subtracts two penalty terms:
Where SA is the synthetic accessibility score (0-10, lower is easier to synthesize), and ringPenalty penalizes macrocycles (large rings that are hard to synthesize). Goal-directed generation optimizes for high plogP — but early methods found pathological solutions (giant rings, unusual atoms) that were structurally valid but not synthesizable.
| Model | QED (top-3 avg) | plogP (top-3 avg) | Validity | Key innovation |
|---|---|---|---|---|
| GraphRNN | 0.89 | 3.1 | 98.1% | Two-RNN autoregressive |
| GCPN | 0.948 | 7.98 | 100% | RL-guided with GCN policy |
| JT-VAE | 0.925 | 5.30 | 100% | Hierarchical motif generation |
| DiGress | 0.952 | 9.30 | 100% | Discrete graph diffusion |
| GDSS | 0.923 | 7.62 | 95.7% | SDE-based continuous diffusion |
Comparing plogP distributions of real drug molecules (warm) vs. GCPN-generated (teal). GCPN successfully shifts the distribution toward higher plogP. But does it go too far?
Graph generation is not just a toy problem. Three domains have demonstrated production-scale impact: drug discovery, material design, and protein engineering.
Graph generation sits at the intersection of graph learning, generative modeling, and chemistry. Understanding where it connects to adjacent topics reveals the full scope of what you've learned.
| Topic | Connection to Graph Generation | Learn more |
|---|---|---|
| GNN Basics | GraphRNN, GCPN, JT-VAE all use GCN/GGNN as the neural backbone for encoding partial graphs. Message passing = reading the current graph state. | Lec 3: GNN |
| Graph Theory | WL isomorphism test bounds what GNNs can distinguish — relevant to whether the model can differentiate valid from invalid subgraphs during generation. | Lec 6: Theory |
| RL Algorithms | GCPN uses PPO for goal-directed generation. The molecule-building MDP is a classic RL setting with sparse reward at episode end. | RL Algorithms |
| Agents + Graphs | GCPN's policy is essentially a graph-grounded agent: at each step, it reads the current graph state, reasons about which action improves properties, and executes. | Lec 17: Agents |
| VAE/VQVAE | JT-VAE directly extends the VAE framework to graph-structured latent spaces. The reparameterization trick and ELBO apply identically. | VAE/VQVAE |
| Diffusion Models | DiGress and GDSS extend DDPM's forward-reverse process to discrete/continuous graph spaces. The score function is now a GNN, not a U-Net. | Diffusion |
"Making drugs is hard. Making drug molecules is easy — once you have a good generator. The hard part is learning what 'good' means."