You et al. — Stanford University, NeurIPS 2018

GCPN: Graph Convolutional Policy Network for Molecular Generation

Grow molecules atom-by-atom using a GCN that reads the current molecular graph and outputs which atom or bond to add next — then train it with RL to maximize drug-likeness.

Prerequisites: Graph neural networks (GCN) + Reinforcement learning basics (policy, reward) + Basic chemistry (atoms, bonds)
8
Chapters
4+
Simulations

Chapter 0: The Problem

Drug discovery is about finding molecules with specific properties: high bioactivity against a target protein, drug-like behavior in the body (the molecule should be absorbable, stable, non-toxic), and good binding affinity. The chemical space of possible drug-like molecules is estimated at 1060 — exploring it by experiment alone would take the lifetime of the universe.

This is the goal-directed molecular generation problem: design a model that generates novel molecules specifically optimized for desired properties. "Novel" means not just memorized from training data. "Optimized" means the model actively steers toward high-property molecules.

Why pure generative models aren't enough: A VAE or GraphRNN learns the distribution of known molecules — it generates things that look like training data. But drugs require optimization, not just plausible generation. You need a model that can explore regions of chemical space that are rare or absent in training data, guided by a property oracle.

The Two Challenges

Validity: A randomly assembled sequence of atoms and bonds is almost certainly not a valid molecule. Most combinations violate valence rules (carbon can't have 5 bonds), ring closure constraints, or basic chemistry. Any generation process must respect these constraints.

Optimization: Even among valid molecules, the space is enormous. You need a way to optimize over this discrete, structured space where you can't take gradients directly (the oracle — a chemical property function like QED or DRD2 binding score — is not differentiable).

GCPN's answer: frame molecular construction as a Markov Decision Process (MDP). The state is the current partially-built molecule. The action is which atom or bond to add next. The reward is the property score at the end. Train a Graph Convolutional Policy Network with Proximal Policy Optimization (PPO) to maximize expected reward — with validity constraints baked in as action masks.

Molecular Generation as Sequential Decisions

Molecules are built step by step. Each step: select an atom type and a scaffold attachment point. Chemical rules constrain which actions are valid. The final molecule receives a property score as reward.

Start building a molecule one atom at a time.
What makes goal-directed molecular generation different from standard generative models like GraphRNN?

Chapter 1: Action Space

The MDP formulation requires a precise definition of what "actions" the policy can take when building a molecule. GCPN defines two types of actions.

Action Type 1: Add Atom

Select an atom type from a vocabulary (C, N, O, F, S, Cl, Br, ...) and attach it to an existing atom in the molecule. This is the "growth" action: the molecule gains one atom.

Formally: action = (anew, aexisting, b) where anew is the new atom type, aexisting is which existing atom to connect to, and b is the bond type (single, double, triple, aromatic).

Action Type 2: Connect (Add Bond)

Add a bond between two existing atoms in the molecule. This is the "ring closure" action: it creates cycles without adding new atoms. Ring closures are crucial for creating the ring systems common in drugs (benzene rings, piperidine, etc.).

Formally: action = (a1, a2, b) where a1 and a2 are existing atom indices and b is the bond type.

The action space size: For a molecule with N atoms, there are roughly |atom_vocab| × N × |bond_types| + N² × |bond_types| possible actions. As the molecule grows, the action space grows too. But at any given step, most actions are chemically invalid (violate valence). The policy must learn to select valid actions — or the system masks out invalid ones.

A Scaffold as the Starting State

GCPN can start from a molecular scaffold — a known pharmacophore or fragment — rather than a single atom. This is useful in real drug discovery: medicinal chemists often want to optimize around a known core structure. The RL policy then learns which atoms to attach to the scaffold to maximize the target property.

State st
Current partial molecule Gt — a graph with atom features and bond types
↓ GCN policy π(a|s)
Action at
Either (add atom: type, attach to, bond) or (add bond: atom1, atom2, bond)
↓ apply action + validity check
State st+1
Updated molecule Gt+1 with new atom or bond
↓ terminal? compute reward
Reward rT
Property score (QED, logP, binding affinity) + step penalty + validity bonus
What is the "connect" action in GCPN's action space?

Chapter 2: GCN Policy Network (Showcase)

The policy network must take the current molecule graph as input and output a probability distribution over actions. A flat feedforward network can't do this — the input size changes as the molecule grows, and the network must be permutation-equivariant (the output shouldn't depend on the arbitrary numbering of atoms).

The answer: a Graph Convolutional Network (GCN). It takes node features (atom types) and edge features (bond types) and computes a representation for each atom via neighborhood aggregation. These representations are then used to score potential actions.

Input Representation

Each atom i has a feature vector xi encoding: atom type (one-hot over C, N, O, F, S, Cl, Br, H), degree, formal charge, number of radical electrons, whether it's in a ring, whether it's aromatic. Concatenated to a ~39-dimensional vector.

Each bond is represented by bond type (single=0, double=1, triple=2, aromatic=3) encoded as a relation type in the GCN layers.

GCN Forward Pass

h(l+1)i = ReLU( W(l) · AGGREGATEj∈N(i)( h(l)j ) + b(l) )

After L=3 GCN layers, each atom has a 64-dimensional embedding. A global pooling (mean) gives a molecule-level embedding for the value function.

From Embeddings to Actions

To score the action "connect atom i to atom j with bond type b": concatenate embeddings hi and hj, pass through an MLP, apply softmax. To score "add new atom of type t to atom i": embed t as a learnable vector, concatenate with hi, MLP, softmax.

GCN Policy — Visualizing Atom Embeddings

A partial benzene ring. Each atom has a GCN embedding (color = embedding magnitude). The policy scores each possible next action — highlighted in orange. Watch embeddings update as you add atoms.

Building benzene (6 carbons in a ring). Add atoms, then close the ring.

Computational Cost

The GCN runs once per action selection step. For a molecule with N atoms and K GCN layers with hidden size H, this costs O(N × H² × K) per step. For drug-like molecules (N typically 20–40 atoms), this is fast — milliseconds per forward pass. The bottleneck is the property oracle (DFT calculations, docking) not the policy network.

Why is a GCN (rather than a feedforward MLP) used as the policy network in GCPN?

Chapter 3: RL Training with PPO

GCPN is trained with Proximal Policy Optimization (PPO) — a policy gradient method that prevents the policy from updating too aggressively in a single step. Combined with adversarial training (a GAN discriminator for distribution matching), GCPN optimizes both property scores and chemical realism.

The Reward Signal

Rewards are designed to balance three objectives simultaneously:

r = αprop · rproperty + αvalid · rvalidity + αadv · radversarial

rproperty: The actual drug-likeness score (QED for drug-likeness, penalized logP for lipophilicity optimization, or DRD2 binding score from a pretrained classifier). This is the main optimization target. It's only observed at the terminal state (complete molecule).

rvalidity: A small reward (+1) for each step that results in a chemically valid intermediate molecule. This dense signal helps the policy learn fast — rather than only getting feedback at the end of a 30-step trajectory.

radversarial: A GAN-style discriminator distinguishes generated molecules from training set molecules. This reward encourages the policy to generate molecules that look like real drugs — preventing it from finding high-property but chemically weird structures that game the oracle.

The discriminator trick: Without the adversarial reward, PPO might find molecules that score high on QED or logP but are structurally bizarre — unlikely to actually work as drugs. The discriminator acts as a learned prior: "does this molecule look like something a medicinal chemist would make?" This prevents reward hacking.

PPO: Why Not Vanilla Policy Gradient?

Vanilla REINFORCE has high variance and can make catastrophically large policy updates that push the agent into a bad regime it can't recover from. PPO clips the probability ratio r(θ) = πθ(a|s) / πθold(a|s) to [1−ε, 1+ε], preventing any single update from moving the policy too far.

LPPO(θ) = Et [ min( rt(θ) At, clip(rt(θ), 1−ε, 1+ε) At ) ]

where At is the advantage: how much better the action at step t was compared to what the value function expected. The value function (a separate MLP head on top of the GCN) is trained simultaneously to predict expected cumulative reward.

What is the purpose of the adversarial reward component radversarial in GCPN's training?

Chapter 4: Chemical Validity

One of GCPN's claims is that it generates chemically valid molecules. How is this guaranteed?

Valence Rules as Action Masks

In chemistry, each atom type has a maximum valence — the number of bonds it can form. Carbon: 4. Nitrogen: 3. Oxygen: 2. Fluorine: 1. Before the policy selects an action, GCPN computes which actions would violate these rules and masks them out (sets their logit to −∞ before softmax). This is a hard constraint: the policy literally cannot select an invalid action.

Validity rates: GCPN reports 100% chemical validity on generated molecules. This is because of the valence masking — every action selected is guaranteed not to create an overvalent atom. The final molecule might still be "unusual" (rare ring system, unusual functional group), but it will always be a valid graph with correct valence.

What "Valid" Doesn't Mean

Chemical validity (correct valence) is necessary but not sufficient for a useful drug. Additional filters in practice include:

GCPN's property optimization uses penalized logP which subtracts a ring-counting penalty and SA score from logP — incorporating synthetic accessibility directly into the reward.

Valence Masking — Which Actions Are Legal?

For each atom in a partial molecule, the available valence determines which bonds can still be added. Green = has free valence (can bond). Red = fully saturated (action masked out).

Green atoms have free valence and can accept new bonds. Red atoms are saturated.
How does GCPN guarantee that generated molecules are chemically valid (correct valence)?

Chapter 5: Property Optimization

GCPN is evaluated on two drug-discovery tasks: optimizing QED (drug-likeness) and optimizing penalized logP (lipophilicity adjusted for rings and synthetic accessibility).

QED: Drug-Likeness

QED (Quantitative Estimate of Druglikeness) is a composite score in [0, 1] that combines 8 molecular properties: molecular weight, logP, number of H-bond donors, H-bond acceptors, aromatic rings, rotatable bonds, polar surface area, and number of structural alerts. QED = 1 is perfectly drug-like; QED = 0 is completely non-drug-like. Most approved drugs have QED > 0.6.

QED = exp( (1/8) ∑i=18 log(di) )

where di ∈ [0,1] is a desirability function for property i, derived from the distribution of approved drugs.

Penalized logP

logP (partition coefficient) measures how lipophilic a molecule is — high logP means it dissolves in fat rather than water, which affects how the drug distributes in the body. The penalized version:

Penalized logP = logP − SA − ring_penalty

SA is the synthetic accessibility score (1=easy to make, 10=impossible), and ring_penalty counts number of rings with >6 atoms (these are usually non-drug-like). This prevents the optimizer from finding high-logP molecules that are unsynthesizable or have bizarre ring systems.

Property Score Explorer

Adjust molecular properties and see how QED and penalized logP respond. This simulates the landscape GCPN is navigating during RL training.

Mol. weight 300
logP 2.0
SA score 3.0
Why is the SA (synthetic accessibility) score subtracted in the penalized logP objective?

Chapter 6: Results

Baselines

Junction Tree VAE (JT-VAE): Encodes the molecule into a latent space, then does Bayesian optimization over the latent space to find high-property molecules. Slow (requires many forward/backward passes per query), but produces valid molecules.

ORGAN: RNN-based generator with RL objectives and adversarial training. Generates SMILES strings sequentially. Not graph-based — must parse SMILES for validity, which can fail.

REINVENT: RNN over SMILES with REINFORCE. The standard pharmaceutical industry baseline.

QED Optimization Results

MethodTop-1 QEDTop-3 QEDTop-10 QEDNovelty %
ORGAN0.8960.8880.876100%
JT-VAE0.9250.9110.896100%
GCPN0.9480.9470.946100%

Penalized logP Optimization Results

MethodTop-1 PlogPTop-3 PlogPTop-10 PlogP
JT-VAE (BO)5.304.934.49
GCPN7.987.857.80
The key takeaway: GCPN significantly outperforms JT-VAE+BO on penalized logP (7.98 vs 5.30 top-1). This is because RL directly optimizes the property, while Bayesian optimization over the latent space is limited by the quality of the VAE's latent structure. For QED, both are near ceiling — GCPN is marginally better.

Constrained Optimization

An additional experiment: modify molecules from the test set to improve their property by at most δ = 0.6 Tanimoto distance (so the generated molecule must be "close" to the input molecule). GCPN achieves 100% success rate vs. JT-VAE's 46.4% at δ = 0.6. Scaffold-constrained generation is where graph-based RL really shines.

On which property does GCPN show the largest improvement over JT-VAE?

Chapter 7: Connections & Beyond

Limitations

3D structure ignored: GCPN works on 2D molecular graphs — atoms and bonds — with no 3D geometry. Drug activity often depends critically on 3D shape (stereochemistry, 3D pharmacophore matching). More recent methods (CVAE, TargetDiff) include 3D coordinates.

Property oracle assumed accessible: GCPN assumes you can evaluate the property score for any molecule during training. For binding affinity against a protein, this might require expensive docking simulations or even wet-lab experiments. Sample efficiency matters.

Long-range structure: GCNs have limited depth (3 layers = 3-hop radius). Long-range interactions in molecules (conformational effects, allosteric binding) may not be captured.

The Molecular Generation Landscape

PaperApproachValidityOptimization
GraphRNN (2018)Sequential AR, two RNNsStatisticalNone (distribution match)
GCPN (2018)RL + GCN, atom-by-atomHard (valence mask)RL (PPO)
JT-VAE (2018)Hierarchical VAE, substructuresHard (valid fragments)Bayesian optimization
MolDQN (2019)DQN, atom-by-atomHard (valence mask)RL (Q-learning)
TargetDiff (2023)3D diffusion, structure-basedStatisticalStructure-conditioned
The RL formulation is general: The idea of using RL to guide graph construction isn't limited to molecules. The same approach has been applied to knowledge graph completion, program synthesis, and network design. GCPN's main contribution is showing that a GCN-based policy can efficiently navigate the combinatorial action space of molecular graph construction.

Related Lessons

"Drug discovery is a combinatorial optimization problem. Reinforcement learning is a framework for combinatorial optimization. The match is too natural to ignore."
— Paraphrase of the GCPN design rationale