Grow molecules atom-by-atom using a GCN that reads the current molecular graph and outputs which atom or bond to add next — then train it with RL to maximize drug-likeness.
Drug discovery is about finding molecules with specific properties: high bioactivity against a target protein, drug-like behavior in the body (the molecule should be absorbable, stable, non-toxic), and good binding affinity. The chemical space of possible drug-like molecules is estimated at 1060 — exploring it by experiment alone would take the lifetime of the universe.
This is the goal-directed molecular generation problem: design a model that generates novel molecules specifically optimized for desired properties. "Novel" means not just memorized from training data. "Optimized" means the model actively steers toward high-property molecules.
Validity: A randomly assembled sequence of atoms and bonds is almost certainly not a valid molecule. Most combinations violate valence rules (carbon can't have 5 bonds), ring closure constraints, or basic chemistry. Any generation process must respect these constraints.
Optimization: Even among valid molecules, the space is enormous. You need a way to optimize over this discrete, structured space where you can't take gradients directly (the oracle — a chemical property function like QED or DRD2 binding score — is not differentiable).
GCPN's answer: frame molecular construction as a Markov Decision Process (MDP). The state is the current partially-built molecule. The action is which atom or bond to add next. The reward is the property score at the end. Train a Graph Convolutional Policy Network with Proximal Policy Optimization (PPO) to maximize expected reward — with validity constraints baked in as action masks.
Molecules are built step by step. Each step: select an atom type and a scaffold attachment point. Chemical rules constrain which actions are valid. The final molecule receives a property score as reward.
The MDP formulation requires a precise definition of what "actions" the policy can take when building a molecule. GCPN defines two types of actions.
Select an atom type from a vocabulary (C, N, O, F, S, Cl, Br, ...) and attach it to an existing atom in the molecule. This is the "growth" action: the molecule gains one atom.
Formally: action = (anew, aexisting, b) where anew is the new atom type, aexisting is which existing atom to connect to, and b is the bond type (single, double, triple, aromatic).
Add a bond between two existing atoms in the molecule. This is the "ring closure" action: it creates cycles without adding new atoms. Ring closures are crucial for creating the ring systems common in drugs (benzene rings, piperidine, etc.).
Formally: action = (a1, a2, b) where a1 and a2 are existing atom indices and b is the bond type.
GCPN can start from a molecular scaffold — a known pharmacophore or fragment — rather than a single atom. This is useful in real drug discovery: medicinal chemists often want to optimize around a known core structure. The RL policy then learns which atoms to attach to the scaffold to maximize the target property.
The policy network must take the current molecule graph as input and output a probability distribution over actions. A flat feedforward network can't do this — the input size changes as the molecule grows, and the network must be permutation-equivariant (the output shouldn't depend on the arbitrary numbering of atoms).
The answer: a Graph Convolutional Network (GCN). It takes node features (atom types) and edge features (bond types) and computes a representation for each atom via neighborhood aggregation. These representations are then used to score potential actions.
Each atom i has a feature vector xi encoding: atom type (one-hot over C, N, O, F, S, Cl, Br, H), degree, formal charge, number of radical electrons, whether it's in a ring, whether it's aromatic. Concatenated to a ~39-dimensional vector.
Each bond is represented by bond type (single=0, double=1, triple=2, aromatic=3) encoded as a relation type in the GCN layers.
After L=3 GCN layers, each atom has a 64-dimensional embedding. A global pooling (mean) gives a molecule-level embedding for the value function.
To score the action "connect atom i to atom j with bond type b": concatenate embeddings hi and hj, pass through an MLP, apply softmax. To score "add new atom of type t to atom i": embed t as a learnable vector, concatenate with hi, MLP, softmax.
A partial benzene ring. Each atom has a GCN embedding (color = embedding magnitude). The policy scores each possible next action — highlighted in orange. Watch embeddings update as you add atoms.
The GCN runs once per action selection step. For a molecule with N atoms and K GCN layers with hidden size H, this costs O(N × H² × K) per step. For drug-like molecules (N typically 20–40 atoms), this is fast — milliseconds per forward pass. The bottleneck is the property oracle (DFT calculations, docking) not the policy network.
GCPN is trained with Proximal Policy Optimization (PPO) — a policy gradient method that prevents the policy from updating too aggressively in a single step. Combined with adversarial training (a GAN discriminator for distribution matching), GCPN optimizes both property scores and chemical realism.
Rewards are designed to balance three objectives simultaneously:
rproperty: The actual drug-likeness score (QED for drug-likeness, penalized logP for lipophilicity optimization, or DRD2 binding score from a pretrained classifier). This is the main optimization target. It's only observed at the terminal state (complete molecule).
rvalidity: A small reward (+1) for each step that results in a chemically valid intermediate molecule. This dense signal helps the policy learn fast — rather than only getting feedback at the end of a 30-step trajectory.
radversarial: A GAN-style discriminator distinguishes generated molecules from training set molecules. This reward encourages the policy to generate molecules that look like real drugs — preventing it from finding high-property but chemically weird structures that game the oracle.
Vanilla REINFORCE has high variance and can make catastrophically large policy updates that push the agent into a bad regime it can't recover from. PPO clips the probability ratio r(θ) = πθ(a|s) / πθold(a|s) to [1−ε, 1+ε], preventing any single update from moving the policy too far.
where At is the advantage: how much better the action at step t was compared to what the value function expected. The value function (a separate MLP head on top of the GCN) is trained simultaneously to predict expected cumulative reward.
One of GCPN's claims is that it generates chemically valid molecules. How is this guaranteed?
In chemistry, each atom type has a maximum valence — the number of bonds it can form. Carbon: 4. Nitrogen: 3. Oxygen: 2. Fluorine: 1. Before the policy selects an action, GCPN computes which actions would violate these rules and masks them out (sets their logit to −∞ before softmax). This is a hard constraint: the policy literally cannot select an invalid action.
Chemical validity (correct valence) is necessary but not sufficient for a useful drug. Additional filters in practice include:
GCPN's property optimization uses penalized logP which subtracts a ring-counting penalty and SA score from logP — incorporating synthetic accessibility directly into the reward.
For each atom in a partial molecule, the available valence determines which bonds can still be added. Green = has free valence (can bond). Red = fully saturated (action masked out).
GCPN is evaluated on two drug-discovery tasks: optimizing QED (drug-likeness) and optimizing penalized logP (lipophilicity adjusted for rings and synthetic accessibility).
QED (Quantitative Estimate of Druglikeness) is a composite score in [0, 1] that combines 8 molecular properties: molecular weight, logP, number of H-bond donors, H-bond acceptors, aromatic rings, rotatable bonds, polar surface area, and number of structural alerts. QED = 1 is perfectly drug-like; QED = 0 is completely non-drug-like. Most approved drugs have QED > 0.6.
where di ∈ [0,1] is a desirability function for property i, derived from the distribution of approved drugs.
logP (partition coefficient) measures how lipophilic a molecule is — high logP means it dissolves in fat rather than water, which affects how the drug distributes in the body. The penalized version:
SA is the synthetic accessibility score (1=easy to make, 10=impossible), and ring_penalty counts number of rings with >6 atoms (these are usually non-drug-like). This prevents the optimizer from finding high-logP molecules that are unsynthesizable or have bizarre ring systems.
Adjust molecular properties and see how QED and penalized logP respond. This simulates the landscape GCPN is navigating during RL training.
Junction Tree VAE (JT-VAE): Encodes the molecule into a latent space, then does Bayesian optimization over the latent space to find high-property molecules. Slow (requires many forward/backward passes per query), but produces valid molecules.
ORGAN: RNN-based generator with RL objectives and adversarial training. Generates SMILES strings sequentially. Not graph-based — must parse SMILES for validity, which can fail.
REINVENT: RNN over SMILES with REINFORCE. The standard pharmaceutical industry baseline.
| Method | Top-1 QED | Top-3 QED | Top-10 QED | Novelty % |
|---|---|---|---|---|
| ORGAN | 0.896 | 0.888 | 0.876 | 100% |
| JT-VAE | 0.925 | 0.911 | 0.896 | 100% |
| GCPN | 0.948 | 0.947 | 0.946 | 100% |
| Method | Top-1 PlogP | Top-3 PlogP | Top-10 PlogP |
|---|---|---|---|
| JT-VAE (BO) | 5.30 | 4.93 | 4.49 |
| GCPN | 7.98 | 7.85 | 7.80 |
An additional experiment: modify molecules from the test set to improve their property by at most δ = 0.6 Tanimoto distance (so the generated molecule must be "close" to the input molecule). GCPN achieves 100% success rate vs. JT-VAE's 46.4% at δ = 0.6. Scaffold-constrained generation is where graph-based RL really shines.
3D structure ignored: GCPN works on 2D molecular graphs — atoms and bonds — with no 3D geometry. Drug activity often depends critically on 3D shape (stereochemistry, 3D pharmacophore matching). More recent methods (CVAE, TargetDiff) include 3D coordinates.
Property oracle assumed accessible: GCPN assumes you can evaluate the property score for any molecule during training. For binding affinity against a protein, this might require expensive docking simulations or even wet-lab experiments. Sample efficiency matters.
Long-range structure: GCNs have limited depth (3 layers = 3-hop radius). Long-range interactions in molecules (conformational effects, allosteric binding) may not be captured.
| Paper | Approach | Validity | Optimization |
|---|---|---|---|
| GraphRNN (2018) | Sequential AR, two RNNs | Statistical | None (distribution match) |
| GCPN (2018) | RL + GCN, atom-by-atom | Hard (valence mask) | RL (PPO) |
| JT-VAE (2018) | Hierarchical VAE, substructures | Hard (valid fragments) | Bayesian optimization |
| MolDQN (2019) | DQN, atom-by-atom | Hard (valence mask) | RL (Q-learning) |
| TargetDiff (2023) | 3D diffusion, structure-based | Statistical | Structure-conditioned |
"Drug discovery is a combinatorial optimization problem. Reinforcement learning is a framework for combinatorial optimization. The match is too natural to ignore."
— Paraphrase of the GCPN design rationale