JT-VAE — Veanors

Chapter 0: The Problem

Picture a molecular generation model that works on SMILES strings — textual representations like "C1CCCCC1" for cyclohexane. You train an RNN or VAE to generate these strings token by token. What's the problem?

Generating "C1CCCCC" and then forgetting to close the ring (no final "1") produces an invalid SMILES. Generating "C(C)(C)(C)(C)C" gives carbon with 5 bonds — impossible. For models operating on SMILES strings, the vast majority of generated strings are chemically invalid. CharVAE, a SMILES-based VAE, has only ~35% validity rate on decoded samples.

The core problem: Molecular validity is a global constraint. Whether a molecule is valid depends on the entire structure simultaneously — you can't check validity locally while building token by token. By the time you know the ring was never closed, you've already committed to the preceding 10 tokens.

JT-VAE's Answer: Build From Valid Pieces

Jin et al.'s insight: instead of generating atoms one at a time (which can create invalid combinations), generate molecules from a vocabulary of known-valid substructures. Every piece in the vocabulary is a valid chemical fragment — a benzene ring, a piperidine ring, a simple chain of 2-3 atoms. Assembling valid pieces can only produce valid molecules.

This is how experienced medicinal chemists think: not "let me add atoms," but "let me combine a pyridine ring with a morpholine and a short alkyl chain." The junction tree VAE formalizes this hierarchical thinking.

Atom-by-atom generation
Works like: add C, add N, add C, close ring...
Easy to create invalid intermediate states. Global validity hard to enforce. ~35-65% validity.

Substructure-by-substructure (JT-VAE)
Works like: place benzene, attach piperidine, add methyl chain...
Each piece is valid. Assembly rules ensure the result is valid. 100% validity.

Valid vs Invalid: The Token-by-Token Problem

Building a molecule atom by atom. Green = current partial molecule is valid. Red = already invalid. Notice how easily a small mistake makes the whole molecule invalid.

Build a molecule step by step and see how quickly invalidity can arise.

Why do SMILES-based VAEs produce many invalid molecules?

SMILES strings are too long for VAEs to memorize The VAE latent space doesn't capture chemical properties Molecular validity is a global constraint — local token decisions can violate rules whose consequences only become visible many steps later (like unclosed rings)

Chapter 1: Tree Decomposition

The key operation that converts a molecular graph into a tree is called junction tree decomposition. It's borrowed from graph theory but applied to chemistry.

Identifying the Substructures

Any molecule can be broken into a set of overlapping subgraphs called clusters. JT-VAE defines clusters as:

Simple bonds: An edge between two atoms that don't form a ring. Cluster = the two atoms + connecting bond.
Rings: Any ring in the molecule (detected via SSSR — Smallest Set of Smallest Rings). Cluster = all atoms and bonds forming the ring.

Every atom appears in at least one cluster. Every bond appears in exactly one cluster (either as a simple bond or as part of a ring).

Building the Junction Tree

Now create a tree where each node represents one cluster. Two cluster-nodes in the tree are connected if the corresponding clusters share at least one atom. Because molecular graphs are close to trees themselves (most drug-like molecules have at most a few rings), this "cluster graph" is usually already a tree or close to one. If there are cycles in the cluster graph, they are contracted using a maximum spanning tree.

Why a tree and not a graph? Trees have the elegant property that there is exactly one path between any two nodes. This means you can process them with simple bottom-up then top-down message passing — no cycles to worry about. Trees are also much easier to generate sequentially: start at the root, expand children one at a time in any order.

The Cluster Vocabulary

Over the ZINC dataset (250,000 drug-like molecules), the vocabulary of unique clusters has only ~780 distinct types. This is the key compression: from an enormous atom-level space to a compact library of meaningful building blocks. At generation time, the decoder only needs to select from these 780 vocabulary items, not from the unbounded space of atoms + bonds.

Molecule → Junction Tree

A schematic showing how a molecule (atoms + bonds) is converted to a junction tree (cluster nodes + connections). Click a cluster node to highlight its atoms in the molecule.

In JT-VAE's junction tree decomposition, what does each node in the junction tree represent?

A single atom in the molecule A single bond between two atoms A cluster — either a simple bond (two atoms) or a complete ring (all atoms in the ring)

Chapter 2: The Junction Tree (Showcase)

Now we can see the full two-level representation that JT-VAE maintains for every molecule: the atom-level graph (the actual molecule) and the cluster-level tree (the junction tree).

Interactive Dual Representation

Left: the molecule graph (atoms = nodes, bonds = edges). Right: the junction tree (each tree node = a cluster). Click a tree node to highlight the corresponding atoms in the molecule. Build the tree step by step using "Add Cluster."

Click "Add Cluster" to grow the junction tree and corresponding molecule simultaneously.

Two Representations, One Molecule

For encoding, both representations are processed simultaneously — the atom graph and the junction tree are both read to produce the latent vector z. For decoding, the junction tree is generated first (which clusters to use and how to connect them), then the atom-level graph is assembled from the specified clusters (which atoms within each cluster to share with neighboring clusters).

The generation order: Tree first, then graph. This is the key insight. The tree is much easier to generate — it's a tree, not an arbitrary graph, and the vocabulary is only ~780 items. Once the tree is fixed, filling in the atom-level detail is a constrained local problem: for each edge in the tree (connecting cluster C_i to C_j), you must choose which atom(s) are shared between the two clusters. This choice is local and bounded.

Information Flow in the Junction Tree

Atoms + Bonds

Raw molecular graph — atom features, bond types, full connectivity

↓ SSSR ring detection + bond classification

Clusters

Each atom/ring assigned to a cluster; clusters get vocabulary labels

↓ Connect clusters sharing atoms

Cluster Graph

Nodes = clusters, edges = shared atoms. Usually already a tree.

↓ Maximum spanning tree (if cycles exist)

Junction Tree

Final tree structure. Root = any cluster. Processed bottom-up then top-down.

In JT-VAE, which representation is generated first during decoding?

The atom-level graph — then the junction tree is inferred from it The junction tree — then the atom-level graph is assembled from the clusters specified by the tree Both are generated simultaneously in a single forward pass

Chapter 3: Encoding

The encoder takes a molecule and produces a latent vector z. It processes both the atom-level graph and the junction tree, then combines them.

Graph-Level Encoder (Atom-Level)

A standard message-passing GNN runs over the molecular graph for T=3 steps. At each step, atoms aggregate messages from neighbors. The final atom representations are mean-pooled into a single graph embedding h_G.

m^(t)_uv = τ( W₁ x_v + W₂ x_u + W₃ e_uv ) h^(T)_v = U_v( x_v, {m^(T)_uv : u ∈ N(v)} ) h_G = mean_v( h^(T)_v )

Tree-Level Encoder (Cluster-Level)

A separate message-passing procedure runs over the junction tree. This is a two-phase process: bottom-up messages from leaves to root, then top-down messages from root to leaves. Each cluster node i gets a representation s_i. The root cluster's final state becomes the tree embedding h_T.

m_ij = GRU( x_i, {m_ki : k ∈ ch(i,j)} ) ← bottom-up s_i = GRU( x_i, {m_ji : j ∈ N(i)} ) ← top-down (all neighbors) h_T = s_root

Combining into Latent Vector

The final latent vector z is sampled from a Gaussian whose mean and variance are parameterized by concatenating both embeddings:

z ~ N( μ([h_G; h_T]), diag(σ²([h_G; h_T])) )

where μ(·) and σ²(·) are linear projections. This z is a continuous, ~56-dimensional latent representation of the molecule — smooth enough for Bayesian optimization or latent space interpolation.

Why encode both? The atom-level encoder captures fine-grained chemistry (atom environments, local bond patterns). The tree-level encoder captures coarse scaffold topology (which rings are present, how they're connected). Together, they give z a rich representation of the molecule at multiple scales.

What two sources of information are combined to produce the latent vector z in JT-VAE's encoder?

The SMILES string and the Morgan fingerprint The 3D coordinates and the 2D topology The atom-level graph embedding h_G (from GNN on molecule) and the tree-level embedding h_T (from GRU on junction tree)

Chapter 4: Decoding

Decoding is the hard part. Given z, we need to reconstruct a molecule. JT-VAE decodes in two phases: first the tree, then the graph.

Phase 1: Tree Decoding

A tree-structured RNN decodes the junction tree from z. Starting from the root, it predicts: (1) the cluster label (which of the ~780 vocabulary substructures is at this tree node?), and (2) whether to grow more children (is this cluster connected to more clusters?). It recurses depth-first until it generates a stop signal at every leaf.

Each prediction is a softmax over the vocabulary (for labels) or a binary decision (for topology). The decoder is conditioned on z throughout — z is concatenated to the GRU hidden state at every step.

h_t = f_tree(z, h_parent) p_label = softmax(W_L h_t) ← which cluster? (780-way) p_stop = σ(W_S h_t) ← stop growing? (binary)

Phase 2: Graph Assembly

Now the tree is fixed — we know which clusters are present and how they connect. The graph assembly step fills in the "attachment chemistry": for each edge in the junction tree (connecting cluster C_i to cluster C_j), we must decide which atoms in C_i bond to which atoms in C_j.

This is done greedily: enumerate all valid attachment configurations between C_i and C_j, score each with a GNN that reads the current partial graph + z, and pick the highest-scoring configuration. Because the number of attachment options per edge is small (bounded by cluster size), this is tractable.

Validity through construction: Phase 1 produces a tree of valid cluster types. Phase 2 attaches them using valid chemistry (each attachment shares an atom). The result is always a valid molecule — there are no post-hoc validity checks needed, because invalidity was structurally impossible from the start.

python
# Simplified JT-VAE decoding pseudocode
def decode(z, tree_decoder, graph_decoder, vocab):
    # Phase 1: decode the junction tree
    root_cluster = tree_decoder.predict_label(z)
    tree = JunctionTree(root=vocab[root_cluster])

    stack = [tree.root]
    while stack:
        node = stack.pop()
        h = tree_decoder.hidden_state(z, node)
        while tree_decoder.predict_continue(h):
            child_label = tree_decoder.predict_label(h)
            child = tree.add_child(node, vocab[child_label])
            stack.append(child)

    # Phase 2: assemble atom-level graph
    mol = init_from_root(tree.root)
    for edge in tree.edges_bfs():
        Ci, Cj = edge
        configs = enumerate_attachments(Ci, Cj)
        scores = graph_decoder.score_all(mol, configs, z)
        best = configs[scores.argmax()]
        mol = apply_attachment(mol, best)

    return mol  # guaranteed valid!

In Phase 2 (graph assembly), what specific decision does the decoder make for each edge in the junction tree?

Whether to include or exclude the edge (making it optional) What bond type (single/double/triple) to use for the inter-cluster bond Which specific atoms in the two clusters are shared (the attachment configuration) — choosing from a small set of valid options

Chapter 5: The Validity Guarantee

JT-VAE claims 100% validity on generated molecules. Let's understand precisely why this is mathematically guaranteed, not just empirically observed.

Why Validity Follows From the Architecture

Step 1 — Every cluster in the vocabulary is valid. The vocabulary is built by extracting clusters from actual molecules in the training set. Every vocabulary item is a chemically valid fragment (correct valence, valid ring system). This is not learned — it's guaranteed by construction from real chemistry.

Step 2 — Tree decoding only picks from the vocabulary. The tree decoder's label prediction is a softmax over the ~780 vocabulary items. It cannot generate a cluster type that isn't in the vocabulary, so every cluster in the generated tree is valid.

Step 3 — Graph assembly only considers valid attachments. When attaching cluster C_i to cluster C_j, the code enumerates all chemically valid ways to share an atom between the two clusters. The decoder picks one of these pre-validated options — it cannot construct an invalid attachment.

Validity Through Hierarchy

Comparison of validity rates across generation approaches. JT-VAE achieves 100% by guaranteeing validity at each hierarchical level. Hover over each bar to see the mechanism.

The Completeness Question

Does every valid molecule have a junction tree decomposition? Yes — any molecular graph can be decomposed into simple bonds and rings, which are exactly the cluster types JT-VAE uses. The only limitation is the vocabulary: if a ring structure appears in a test molecule but not in any training molecule, it won't be in the vocabulary and can't be generated. In practice, ZINC's ~780 cluster vocabulary covers the vast majority of drug-like structures.

Tradeoff: The validity guarantee comes at a cost — JT-VAE can only generate molecules whose substructures are all in the training vocabulary. Atom-by-atom methods can in principle generate any molecule (but with low validity). JT-VAE is more restricted but more reliable. For drug discovery in well-explored chemical spaces, this tradeoff is generally worth it.

What guarantees that every cluster in a JT-VAE-generated molecule is chemically valid?

The decoder learns to avoid invalid clusters during training via cross-entropy loss A post-processing step checks and fixes each cluster before assembly The vocabulary was built from real molecules, so every vocabulary item is a valid fragment by construction — the decoder can only select from valid options

Chapter 6: Results

Reconstruction and Sampling

On the ZINC dataset (250k molecules), JT-VAE achieves:

Metric	CVAE (SMILES)	GVAE (grammar)	JT-VAE
Reconstruction accuracy	44.6%	53.7%	76.7%
Validity (random sample)	0.7%	7.2%	100.0%
Uniqueness (of valid)	100%	100%	99.9%
Novelty (not in train)	90.0%	61.0%	99.9%

Bayesian Optimization in Latent Space

The continuous, smooth latent space enables Bayesian optimization (BO): fit a Gaussian process to observed (z, property) pairs, use the GP acquisition function to select promising z values, decode them to molecules, evaluate, and iterate. This is a classical optimization approach that works beautifully when z is well-structured.

Method	Penalized logP (top-3)	QED (top-3)
CVAE + BO	−3.01, −3.32, −3.49	0.734, 0.712, 0.706
GVAE + BO	−3.63, −3.91, −3.98	0.752, 0.738, 0.730
JT-VAE + BO	5.30, 4.93, 4.49	0.925, 0.911, 0.896

Dramatic improvement over SMILES-VAE + BO: JT-VAE achieves penalized logP of 5.30 vs CVAE's −3.01. The key factor: JT-VAE's latent space is smooth and well-structured (100% valid decodes), so the Gaussian process can model it accurately. CVAE's latent space is chaotic (99.3% of random samples are invalid), so the GP is fitting noise.

Why does JT-VAE's latent space work so much better for Bayesian optimization than SMILES-VAE's?

JT-VAE uses a larger latent space dimension (more dimensions = more expressive) JT-VAE's decoder produces valid molecules from every latent point, so the property is well-defined everywhere and the Gaussian process can model it accurately. SMILES-VAE mostly decodes to invalid molecules, making the property landscape a sea of noise. JT-VAE uses a different acquisition function in the Bayesian optimization loop

Chapter 7: Connections & Beyond

Limitations

Vocabulary bound: Can only generate molecules whose substructures appear in training vocabulary. Novel ring systems not in ZINC are unreachable.

3D ignored: Like GCPN, JT-VAE works on 2D topology. Stereochemistry (chirality, E/Z isomers) and 3D binding geometry are not modeled.

Slow decoding: The two-phase sequential decoding (tree first, then graph assembly with enumeration) is slower than a single forward pass. Tree RNN is O(N) steps, graph assembly involves per-edge enumeration.

Fixed vocabulary assumption: If you want to generate macrocycles or highly unusual heterocycles, you need a training set that contains them — the vocabulary-based approach constrains the generation space.

The Molecular Generation Hierarchy

Granularity	Method	Validity	Flexibility
Character (token)	CharVAE (SMILES chars)	~35%	Highest
Grammar rule	GVAE (context-free grammar)	~7%	High
Atom	GraphRNN, GCPN, MolGAN	~65–100%	Medium
Substructure	JT-VAE	100%	Vocab-bounded
Fragment (larger)	BRICS + ML, Fragment-VAE	~100%	Lower

The hierarchical generation principle is general: JT-VAE's insight — "generate at the level of valid building blocks, then assemble" — has been applied far beyond molecules. Code generation using valid AST nodes, protein generation using known secondary structure motifs, reaction pathway planning using known reaction types. The lesson: if you know valid primitive units, build a generative model in that space rather than the raw token space.

Related Lessons

GraphRNN — atom-by-atom graph generation without validity constraints
GCPN — RL approach to the same goal-directed molecular generation problem
GCN — the graph convolution used in JT-VAE's graph encoder
VAE/VQVAE — the latent variable model framework underlying JT-VAE

"The key to reliable generation is to work in a space where every point decodes to something valid. For molecules, that means working with known-valid substructures, not raw atoms."
— JT-VAE design philosophy

JT-VAE: Junction Tree Variational Autoencoder