GCPN — Veanors

Chapter 0: The Problem

Drug discovery is about finding molecules with specific properties: high bioactivity against a target protein, drug-like behavior in the body (the molecule should be absorbable, stable, non-toxic), and good binding affinity. The chemical space of possible drug-like molecules is estimated at 10⁶⁰ — exploring it by experiment alone would take the lifetime of the universe.

This is the goal-directed molecular generation problem: design a model that generates novel molecules specifically optimized for desired properties. "Novel" means not just memorized from training data. "Optimized" means the model actively steers toward high-property molecules.

Why pure generative models aren't enough: A VAE or GraphRNN learns the distribution of known molecules — it generates things that look like training data. But drugs require optimization, not just plausible generation. You need a model that can explore regions of chemical space that are rare or absent in training data, guided by a property oracle.

The Two Challenges

Validity: A randomly assembled sequence of atoms and bonds is almost certainly not a valid molecule. Most combinations violate valence rules (carbon can't have 5 bonds), ring closure constraints, or basic chemistry. Any generation process must respect these constraints.

Optimization: Even among valid molecules, the space is enormous. You need a way to optimize over this discrete, structured space where you can't take gradients directly (the oracle — a chemical property function like QED or DRD2 binding score — is not differentiable).

GCPN's answer: frame molecular construction as a Markov Decision Process (MDP). The state is the current partially-built molecule. The action is which atom or bond to add next. The reward is the property score at the end. Train a Graph Convolutional Policy Network with Proximal Policy Optimization (PPO) to maximize expected reward — with validity constraints baked in as action masks.

Molecular Generation as Sequential Decisions

Molecules are built step by step. Each step: select an atom type and a scaffold attachment point. Chemical rules constrain which actions are valid. The final molecule receives a property score as reward.

Start building a molecule one atom at a time.

What makes goal-directed molecular generation different from standard generative models like GraphRNN?

Goal-directed models generate larger molecules Goal-directed models use different graph architectures Goal-directed models actively optimize toward specific molecular properties, not just model the training distribution

Chapter 1: Action Space

The MDP formulation requires a precise definition of what "actions" the policy can take when building a molecule. GCPN defines two types of actions.

Action Type 1: Add Atom

Select an atom type from a vocabulary (C, N, O, F, S, Cl, Br, ...) and attach it to an existing atom in the molecule. This is the "growth" action: the molecule gains one atom.

Formally: action = (a_new, a_existing, b) where a_new is the new atom type, a_existing is which existing atom to connect to, and b is the bond type (single, double, triple, aromatic).

Action Type 2: Connect (Add Bond)

Add a bond between two existing atoms in the molecule. This is the "ring closure" action: it creates cycles without adding new atoms. Ring closures are crucial for creating the ring systems common in drugs (benzene rings, piperidine, etc.).

Formally: action = (a₁, a₂, b) where a₁ and a₂ are existing atom indices and b is the bond type.

The action space size: For a molecule with N atoms, there are roughly |atom_vocab| × N × |bond_types| + N² × |bond_types| possible actions. As the molecule grows, the action space grows too. But at any given step, most actions are chemically invalid (violate valence). The policy must learn to select valid actions — or the system masks out invalid ones.

A Scaffold as the Starting State

GCPN can start from a molecular scaffold — a known pharmacophore or fragment — rather than a single atom. This is useful in real drug discovery: medicinal chemists often want to optimize around a known core structure. The RL policy then learns which atoms to attach to the scaffold to maximize the target property.

State s_t

Current partial molecule G_t — a graph with atom features and bond types

↓ GCN policy π(a|s)

Action a_t

Either (add atom: type, attach to, bond) or (add bond: atom1, atom2, bond)

↓ apply action + validity check

State s_t+1

Updated molecule G_t+1 with new atom or bond

↓ terminal? compute reward

Reward r_T

Property score (QED, logP, binding affinity) + step penalty + validity bonus

What is the "connect" action in GCPN's action space?

Connecting the completed molecule to a protein target for docking Connecting two separate molecular fragments into one Adding a bond between two existing atoms in the current partial molecule (ring closure)

Chapter 2: GCN Policy Network (Showcase)

The policy network must take the current molecule graph as input and output a probability distribution over actions. A flat feedforward network can't do this — the input size changes as the molecule grows, and the network must be permutation-equivariant (the output shouldn't depend on the arbitrary numbering of atoms).

The answer: a Graph Convolutional Network (GCN). It takes node features (atom types) and edge features (bond types) and computes a representation for each atom via neighborhood aggregation. These representations are then used to score potential actions.

Input Representation

Each atom i has a feature vector x_i encoding: atom type (one-hot over C, N, O, F, S, Cl, Br, H), degree, formal charge, number of radical electrons, whether it's in a ring, whether it's aromatic. Concatenated to a ~39-dimensional vector.

Each bond is represented by bond type (single=0, double=1, triple=2, aromatic=3) encoded as a relation type in the GCN layers.

GCN Forward Pass

h^(l+1)_i = ReLU( W^(l) · AGGREGATE_j∈N(i)( h^(l)_j ) + b^(l) )

After L=3 GCN layers, each atom has a 64-dimensional embedding. A global pooling (mean) gives a molecule-level embedding for the value function.

From Embeddings to Actions

To score the action "connect atom i to atom j with bond type b": concatenate embeddings h_i and h_j, pass through an MLP, apply softmax. To score "add new atom of type t to atom i": embed t as a learnable vector, concatenate with h_i, MLP, softmax.

GCN Policy — Visualizing Atom Embeddings

A partial benzene ring. Each atom has a GCN embedding (color = embedding magnitude). The policy scores each possible next action — highlighted in orange. Watch embeddings update as you add atoms.

Building benzene (6 carbons in a ring). Add atoms, then close the ring.

Computational Cost

The GCN runs once per action selection step. For a molecule with N atoms and K GCN layers with hidden size H, this costs O(N × H² × K) per step. For drug-like molecules (N typically 20–40 atoms), this is fast — milliseconds per forward pass. The bottleneck is the property oracle (DFT calculations, docking) not the policy network.

Why is a GCN (rather than a feedforward MLP) used as the policy network in GCPN?

GCNs are faster than MLPs for all input types The input molecule graph changes size as atoms are added, and GCNs are permutation-equivariant — they handle variable-size graphs regardless of atom numbering GCNs can directly enforce chemical valence rules

Chapter 3: RL Training with PPO

GCPN is trained with Proximal Policy Optimization (PPO) — a policy gradient method that prevents the policy from updating too aggressively in a single step. Combined with adversarial training (a GAN discriminator for distribution matching), GCPN optimizes both property scores and chemical realism.

The Reward Signal

Rewards are designed to balance three objectives simultaneously:

r = α_prop · r_property + α_valid · r_validity + α_adv · r_adversarial

r_property: The actual drug-likeness score (QED for drug-likeness, penalized logP for lipophilicity optimization, or DRD2 binding score from a pretrained classifier). This is the main optimization target. It's only observed at the terminal state (complete molecule).

r_validity: A small reward (+1) for each step that results in a chemically valid intermediate molecule. This dense signal helps the policy learn fast — rather than only getting feedback at the end of a 30-step trajectory.

r_adversarial: A GAN-style discriminator distinguishes generated molecules from training set molecules. This reward encourages the policy to generate molecules that look like real drugs — preventing it from finding high-property but chemically weird structures that game the oracle.

The discriminator trick: Without the adversarial reward, PPO might find molecules that score high on QED or logP but are structurally bizarre — unlikely to actually work as drugs. The discriminator acts as a learned prior: "does this molecule look like something a medicinal chemist would make?" This prevents reward hacking.

PPO: Why Not Vanilla Policy Gradient?

Vanilla REINFORCE has high variance and can make catastrophically large policy updates that push the agent into a bad regime it can't recover from. PPO clips the probability ratio r(θ) = π_θ(a|s) / π_θold(a|s) to [1−ε, 1+ε], preventing any single update from moving the policy too far.

L^PPO(θ) = E_t [ min( r_t(θ) A_t, clip(r_t(θ), 1−ε, 1+ε) A_t ) ]

where A_t is the advantage: how much better the action at step t was compared to what the value function expected. The value function (a separate MLP head on top of the GCN) is trained simultaneously to predict expected cumulative reward.

What is the purpose of the adversarial reward component r_adversarial in GCPN's training?

To speed up PPO convergence by providing gradient information To prevent the policy from finding high-property but structurally unrealistic molecules that game the property oracle To ensure the generated molecules match the size distribution of training data

Chapter 4: Chemical Validity

One of GCPN's claims is that it generates chemically valid molecules. How is this guaranteed?

Valence Rules as Action Masks

In chemistry, each atom type has a maximum valence — the number of bonds it can form. Carbon: 4. Nitrogen: 3. Oxygen: 2. Fluorine: 1. Before the policy selects an action, GCPN computes which actions would violate these rules and masks them out (sets their logit to −∞ before softmax). This is a hard constraint: the policy literally cannot select an invalid action.

Validity rates: GCPN reports 100% chemical validity on generated molecules. This is because of the valence masking — every action selected is guaranteed not to create an overvalent atom. The final molecule might still be "unusual" (rare ring system, unusual functional group), but it will always be a valid graph with correct valence.

What "Valid" Doesn't Mean

Chemical validity (correct valence) is necessary but not sufficient for a useful drug. Additional filters in practice include:

Lipinski's Rule of Five: MW ≤ 500 Da, HBD ≤ 5, HBA ≤ 10, logP ≤ 5
PAINS filters: Alert patterns that cause false positives in assays (reactive groups, fluorescent compounds)
Synthetic accessibility: SA score estimates how hard a molecule is to synthesize (1=easy, 10=impossible)

GCPN's property optimization uses penalized logP which subtracts a ring-counting penalty and SA score from logP — incorporating synthetic accessibility directly into the reward.

Valence Masking — Which Actions Are Legal?

For each atom in a partial molecule, the available valence determines which bonds can still be added. Green = has free valence (can bond). Red = fully saturated (action masked out).

Green atoms have free valence and can accept new bonds. Red atoms are saturated.

How does GCPN guarantee that generated molecules are chemically valid (correct valence)?

By penalizing invalid molecules with negative reward during training By post-processing generated molecules to fix valence violations By masking out actions that would violate valence rules before the softmax — the policy literally cannot select invalid actions

Chapter 5: Property Optimization

GCPN is evaluated on two drug-discovery tasks: optimizing QED (drug-likeness) and optimizing penalized logP (lipophilicity adjusted for rings and synthetic accessibility).

QED: Drug-Likeness

QED (Quantitative Estimate of Druglikeness) is a composite score in [0, 1] that combines 8 molecular properties: molecular weight, logP, number of H-bond donors, H-bond acceptors, aromatic rings, rotatable bonds, polar surface area, and number of structural alerts. QED = 1 is perfectly drug-like; QED = 0 is completely non-drug-like. Most approved drugs have QED > 0.6.

QED = exp( (1/8) ∑_i=1⁸ log(d_i) )

where d_i ∈ [0,1] is a desirability function for property i, derived from the distribution of approved drugs.

Penalized logP

logP (partition coefficient) measures how lipophilic a molecule is — high logP means it dissolves in fat rather than water, which affects how the drug distributes in the body. The penalized version:

Penalized logP = logP − SA − ring_penalty

SA is the synthetic accessibility score (1=easy to make, 10=impossible), and ring_penalty counts number of rings with >6 atoms (these are usually non-drug-like). This prevents the optimizer from finding high-logP molecules that are unsynthesizable or have bizarre ring systems.

Property Score Explorer

Adjust molecular properties and see how QED and penalized logP respond. This simulates the landscape GCPN is navigating during RL training.

Mol. weight 300

logP 2.0

SA score 3.0

Why is the SA (synthetic accessibility) score subtracted in the penalized logP objective?

Because higher SA scores indicate the molecule is already known to be a drug Because SA score correlates with molecular weight, which should be penalized To prevent the optimizer from finding high-logP molecules that are extremely hard or impossible to synthesize in a lab

Chapter 6: Results

Baselines

Junction Tree VAE (JT-VAE): Encodes the molecule into a latent space, then does Bayesian optimization over the latent space to find high-property molecules. Slow (requires many forward/backward passes per query), but produces valid molecules.

ORGAN: RNN-based generator with RL objectives and adversarial training. Generates SMILES strings sequentially. Not graph-based — must parse SMILES for validity, which can fail.

REINVENT: RNN over SMILES with REINFORCE. The standard pharmaceutical industry baseline.

QED Optimization Results

Method	Top-1 QED	Top-3 QED	Top-10 QED	Novelty %
ORGAN	0.896	0.888	0.876	100%
JT-VAE	0.925	0.911	0.896	100%
GCPN	0.948	0.947	0.946	100%

Penalized logP Optimization Results

Method	Top-1 PlogP	Top-3 PlogP	Top-10 PlogP
JT-VAE (BO)	5.30	4.93	4.49
GCPN	7.98	7.85	7.80

The key takeaway: GCPN significantly outperforms JT-VAE+BO on penalized logP (7.98 vs 5.30 top-1). This is because RL directly optimizes the property, while Bayesian optimization over the latent space is limited by the quality of the VAE's latent structure. For QED, both are near ceiling — GCPN is marginally better.

Constrained Optimization

An additional experiment: modify molecules from the test set to improve their property by at most δ = 0.6 Tanimoto distance (so the generated molecule must be "close" to the input molecule). GCPN achieves 100% success rate vs. JT-VAE's 46.4% at δ = 0.6. Scaffold-constrained generation is where graph-based RL really shines.

On which property does GCPN show the largest improvement over JT-VAE?

QED — GCPN achieves 0.948 vs JT-VAE's 0.925 Penalized logP — GCPN achieves 7.98 vs JT-VAE's 5.30 (50% higher) Novelty — GCPN generates more novel molecules

Chapter 7: Connections & Beyond

Limitations

3D structure ignored: GCPN works on 2D molecular graphs — atoms and bonds — with no 3D geometry. Drug activity often depends critically on 3D shape (stereochemistry, 3D pharmacophore matching). More recent methods (CVAE, TargetDiff) include 3D coordinates.

Property oracle assumed accessible: GCPN assumes you can evaluate the property score for any molecule during training. For binding affinity against a protein, this might require expensive docking simulations or even wet-lab experiments. Sample efficiency matters.

Long-range structure: GCNs have limited depth (3 layers = 3-hop radius). Long-range interactions in molecules (conformational effects, allosteric binding) may not be captured.

The Molecular Generation Landscape

Paper	Approach	Validity	Optimization
GraphRNN (2018)	Sequential AR, two RNNs	Statistical	None (distribution match)
GCPN (2018)	RL + GCN, atom-by-atom	Hard (valence mask)	RL (PPO)
JT-VAE (2018)	Hierarchical VAE, substructures	Hard (valid fragments)	Bayesian optimization
MolDQN (2019)	DQN, atom-by-atom	Hard (valence mask)	RL (Q-learning)
TargetDiff (2023)	3D diffusion, structure-based	Statistical	Structure-conditioned

The RL formulation is general: The idea of using RL to guide graph construction isn't limited to molecules. The same approach has been applied to knowledge graph completion, program synthesis, and network design. GCPN's main contribution is showing that a GCN-based policy can efficiently navigate the combinatorial action space of molecular graph construction.

Related Lessons

GraphRNN — the pure generative (no RL) counterpart for graph generation
JT-VAE — hierarchical VAE approach to the same molecular generation problem
GCN — the graph convolutional layer GCPN uses as its backbone
PPO — the RL algorithm used to train the policy

"Drug discovery is a combinatorial optimization problem. Reinforcement learning is a framework for combinatorial optimization. The match is too natural to ignore."
— Paraphrase of the GCPN design rationale

GCPN: Graph Convolutional Policy Network for Molecular Generation