Neural Message Passing for Quantum Chemistry

Chapter 0: The Problem

You are a chemist. You have a new molecule and you want to know its properties — how much energy holds its atoms together, how it distributes its electrons, whether it vibrates at certain frequencies. The gold-standard method is called Density Functional Theory (DFT), a quantum mechanical simulation. For a single molecule with 9 heavy atoms, DFT takes about an hour on a modern CPU. For a larger molecule, it can take eight hours.

Now imagine you have 134,000 molecules. That is over 15 CPU-years of computation.

What if a neural network could learn to approximate the DFT calculation, predicting those same properties in a fraction of a second? That is the promise of this paper: train a neural network on molecules, feed it new molecules it has never seen, and get quantum-accurate predictions 300,000 times faster than DFT.

The challenge: Molecules are not images or sequences. They are graphs — atoms connected by bonds, with no fixed ordering. Any neural network that operates on molecules must be invariant to graph isomorphism: relabeling the atoms should not change the prediction.

By 2017, at least eight different research groups had proposed neural network architectures for learning on graphs. Each paper introduced its own notation, its own framework, its own way of thinking about the problem. From the outside, they looked like eight different ideas.

Gilmer et al. looked at all eight and made a striking observation: they are all doing the same thing. Every single one can be described as nodes passing messages to their neighbors, aggregating those messages, and updating their state. The only differences are in which message function, which update function, and which readout function they use.

The paper calls this shared abstraction Message Passing Neural Networks (MPNNs). It is not a new model — it is a unifying lens. And once you have that lens, you can mix and match components to find combinations that nobody had tried.

Why can't we simply feed a molecule's atoms as a flat list into a standard neural network?

Because the prediction must be invariant to the ordering of atoms — shuffling the list should not change the output, and standard networks are sensitive to input order Because molecules have too many atoms for a fully connected network Because standard neural networks cannot handle numerical features

Chapter 1: Molecules as Graphs

A molecule is a natural graph. Each atom is a node, and each bond is an edge. Take ethanol (the alcohol in your drink): two carbons, one oxygen, six hydrogens, connected by single bonds. That is a graph with nine nodes and eight edges.

Each node carries features — information about the atom it represents:

Feature	Description	Encoding
Atom type	H, C, N, O, or F	One-hot (5 dims)
Atomic number	Number of protons	Integer
Acceptor	Accepts electrons?	Binary
Donor	Donates electrons?	Binary
Aromatic	In an aromatic ring?	Binary
Hybridization	sp, sp2, or sp3	One-hot
Num. Hydrogens	Attached H atoms	Integer

Each edge also carries features. At minimum, the bond type (single, double, triple, or aromatic). When spatial information is available, the edge also stores the Euclidean distance between the two atoms.

Two input settings. The paper considers both (1) the graph-only setting, where only atom types and bond types are known, and (2) the spatial setting, where 3D atom positions are also provided. The spatial setting is much easier — distances tell you a lot about molecular properties. The graph-only setting is harder but more broadly useful, since computing 3D geometry is itself expensive.

The paper uses the QM9 dataset: 134,000 small organic molecules (up to 9 heavy atoms each), with 13 quantum-chemical properties computed by DFT. These properties span four broad categories:

Energetics

Atomization energy (U₀, U, H, G) — how tightly bound the atoms are

↓

Vibrations

ZPVE and highest vibrational frequency ω₁

↓

Electronic

HOMO, LUMO, energy gap Δε

↓

Spatial

Dipole moment μ, polarizability α, electronic extent R²

Success is measured against two benchmarks: DFT error (matching the DFT simulation) and chemical accuracy (a tighter target established by chemists). Achieving chemical accuracy on all 13 targets would mean the neural network is accurate enough to replace DFT for practical purposes.

In the QM9 dataset, what does each "sample" represent?

A single image of a molecule A molecule with atom features, bond features, and 13 quantum-chemical target properties computed by DFT A time-series of molecular dynamics steps

Chapter 2: The MPNN Framework

Here is the core idea of the paper. Every graph neural network the authors surveyed does the same three things. They just implement them differently.

Phase 1: Message Passing (runs for T time steps)

Every node v in the graph has a hidden state h_v^t. At each time step, two things happen:

Step 1: Compute messages. For each node v, gather a message from every neighbor w:

m_v^t+1 = ∑_{w ∈ N(v)} M_t(h_v^t, h_w^t, e_vw)

where M_t is the message function, h_w^t is the neighbor's current state, and e_vw is the edge feature between v and w. The sum aggregates messages from all neighbors.

Step 2: Update states. Each node updates its hidden state using the aggregated message:

h_v^t+1 = U_t(h_v^t, m_v^t+1)

where U_t is the update function. This takes the node's old state and the incoming messages and produces a new state.

Phase 2: Readout

After T steps of message passing, we need a single prediction for the whole molecule. The readout function R takes all the final node states and produces a graph-level output:

ŷ = R({h_v^T | v ∈ G})

The unifying insight: M (message), U (update), R (readout). That is the entire framework. Every graph neural network the authors surveyed — GG-NN, Interaction Networks, Molecular Fingerprints, Deep Tensor, Laplacian methods, and more — is just a specific choice of M, U, and R. The paper shows this explicitly for eight existing models.

Why is this useful? Because once you realize all these models are instances of the same framework, you can mix and match. Take the message function from one model, the readout from another, and see if the combination works better. That is exactly what the paper does.

One critical requirement: the readout function R must be invariant to permutations of the node states. Shuffling the atoms should not change the molecular property. This rules out simple concatenation but allows operations like summation, attention-weighted pooling, and set2set.

What are the three components that define any MPNN?

Encoder, decoder, attention Convolution, pooling, fully-connected Message function M, update function U, and readout function R

Chapter 3: Message Functions

The message function M decides what information a neighbor sends. Different choices lead to different models. Let's walk through three key variants the paper explores.

1. Matrix Multiplication (from Gated Graph Neural Networks)

M(h_v, h_w, e_vw) = A_{e_vw} h_w

There is one learned matrix A for each discrete edge type (single bond, double bond, etc.). The message is just the neighbor's hidden state multiplied by the edge-type-specific matrix. Simple, but it requires edge types to be discrete — you cannot use continuous features like distance.

2. Edge Network (proposed in this paper)

M(h_v, h_w, e_vw) = A(e_vw) h_w

Instead of a fixed matrix per edge type, a small neural network A takes the edge feature vector e_vw and outputs a d × d matrix. This matrix then multiplies the neighbor's state. The edge network can handle continuous edge features like spatial distance, not just discrete bond types.

Why edge networks matter: With matrix multiplication, single bonds and double bonds get different learned matrices, but you cannot smoothly interpolate. With edge networks, the message transformation depends continuously on the edge features. A bond at 1.2 Angstroms and one at 1.3 Angstroms will produce slightly different transformation matrices.

3. Pair Message (from Interaction Networks)

m_wv = f(h_w^t, h_v^t, e_vw)

A neural network f takes the concatenation of both endpoint states plus the edge feature. This is the most expressive option — the message depends on both the sender and the receiver. However, the paper found it performs worse than the edge network in practice (Table 9). The extra expressiveness may make optimization harder.

Message Function	Depends on h_w?	Depends on h_v?	Continuous edges?
Matrix Multiply	Yes	No	No
Edge Network	Yes	No	Yes
Pair Message	Yes	Yes	Yes

The paper's best results use the edge network: expressive enough to handle continuous features, but not so expressive that it becomes hard to train.

What is the key advantage of the edge network message function over the matrix multiply approach?

It uses fewer parameters It makes the message depend on the receiver node It can handle continuous edge features like spatial distance, not just discrete bond types

Chapter 4: Update and Readout

After a node collects messages from its neighbors, it needs to combine them with its own current state to produce a new state. That is the update function U.

The paper uses a Gated Recurrent Unit (GRU) for the update, the same recurrent cell used in sequence models:

h_v^t+1 = GRU(h_v^t, m_v^t+1)

Think of the GRU as a learned gate that decides how much of the old state to keep and how much to overwrite with the new message. This is shared across all time steps (weight tying), so the same update function is applied at step 1, step 2, and so on.

Why a GRU? The message passing process runs for T steps. At each step, a node refines its state based on new messages. This is analogous to a recurrent process — the same operation applied repeatedly, with the state accumulating information over time. A GRU handles this gracefully because its gating mechanism prevents information from being washed away over many steps.

The initial hidden state h_v⁰ is simply the atom's feature vector x_v, padded with zeros up to dimension d.

Now for the readout function R. After T message passing steps, each node has a final state h_v^T that encodes information about its local neighborhood (and, through multiple hops, the broader molecular context). We need to collapse all node states into a single graph-level prediction.

The paper compares two readout approaches:

GG-NN Readout

R = ∑_v σ(i(h_v^T, h_v⁰)) ⊙ j(h_v^T)
Two neural networks i and j, element-wise gating, then sum

Set2Set Readout

An attention-based pooling mechanism that iterates M times over the node set, producing a permutation-invariant embedding

The set2set model (from Vinyals et al., 2015) is more powerful. Instead of a single pass over the nodes, it runs M steps of content-based attention, repeatedly querying the node states. At each step, it attends to different parts of the molecule, building up a richer graph-level representation. Think of it as reading the molecule multiple times, each time focusing on different atoms.

Set2Set is key. Switching from GG-NN readout to set2set improved results on nearly every target (Table 7 in the paper). Simple summation throws away information about which atoms contributed what. Set2set preserves more of the structural information by attending to the node set multiple times.

Why must the readout function R be invariant to the ordering of the nodes?

Because labeling the atoms 1,2,3 vs 3,1,2 should not change the predicted molecular property — the molecule is the same regardless of how we number its atoms Because it reduces the number of parameters Because node ordering is determined by atomic number

Chapter 5: Edge Networks

Let's look more carefully at the edge network, because it is the paper's strongest contribution beyond the framework itself.

In the matrix multiplication approach (GG-NN), each discrete edge type gets its own learned matrix. A single bond uses matrix A₁, a double bond uses A₂, and so on. But what if your edge features are continuous? What if you know the atoms are 1.54 Angstroms apart?

You could discretize: bin the distances into 10 buckets and treat each bucket as a separate edge type. The paper does this and calls it "distance bins." But binning is lossy. Two bonds at 1.51 and 1.59 Angstroms end up in the same bucket.

The edge network replaces the lookup table with a neural network:

A(e_vw) = NN(e_vw) ∈ R^{d × d}

The input e_vw is a 5-dimensional vector: the Euclidean distance plus a one-hot encoding of the bond type. The neural network outputs a d × d matrix. This matrix is then multiplied by the neighbor's hidden state to produce the message:

m_w→v = A(e_vw) · h_w^t

The edge network as a learned interaction law. In physics, the force between two particles depends on their distance. Different interaction laws (gravity, electrostatics, van der Waals) have different functional forms. The edge network learns a transformation that depends on distance and bond type — effectively learning an interaction law from data.

How big is this neural network? It takes a 5-dimensional input and outputs d² values (which are reshaped into a d × d matrix). For d = 200, that is 40,000 output values. The network is small (one or two hidden layers), but it is applied to every edge at every message passing step.

The results are clear. Comparing the edge network (enn) with the pair message function (pm) on joint training across all 13 targets, the edge network wins on nearly every target (Table 9). The average MAE across targets: edge network 1.53 vs pair message 3.98. The edge network is not just slightly better — it is dramatically better.

Input Features	Message Function	Avg. MAE (ratio to chem. accuracy)
Graph only	Matrix Multiply	2.57
+ Distance	Edge Network	0.98
+ Distance + explicit H	Edge Network	0.68

Adding distance information and using the edge network together cuts the error by more than 3x. Adding explicit hydrogen atoms as nodes (instead of encoding them as a feature) further reduces error. More nodes means a bigger graph and slower training (roughly 10x), but the accuracy gains are substantial.

What does the edge network output?

A d × d matrix that transforms the neighbor's hidden state into a message, where the matrix depends continuously on the edge features (distance + bond type) A scalar weight for attention A fixed embedding for each bond type

Chapter 6: Virtual Elements & Towers

Two practical problems emerge when you try to scale MPNNs. First: information travels slowly. After T steps of message passing, a node can only "see" atoms within T hops in the graph. If two atoms are on opposite ends of a molecule and T = 3, they never communicate directly.

Second: computation grows as O(n²d²) for dense graphs, where n is the number of atoms and d is the hidden dimension. For large molecules or large hidden states, this becomes expensive.

The paper proposes two solutions for the communication problem:

Virtual Edges

Add a special edge type between every pair of non-bonded atoms. Now every atom is one hop from every other atom. Information travels globally in a single step.

Master Node

Add a single "master" node connected to every atom with a special edge type. It acts as a global scratch space — every atom reads from and writes to it at every step.

Both approaches let distant atoms communicate without needing many message passing steps. The master node is particularly elegant: its complexity scales as O(|E|d² + nd²_master), so you can give it a large hidden dimension without blowing up the cost of edge-level computation.

For the scalability problem, the paper introduces towers. Instead of running one propagation with a d-dimensional hidden state, split each node's state into k copies of dimension d/k. Run message passing on each copy separately (with separate learned functions), then mix the k copies back together with a neural network:

h_v^t,1, h_v^t,2, …, h_v^t,k = g(h̃_v^t,1, h̃_v^t,2, …, h̃_v^t,k)

Each "tower" does O(n²(d/k)²) work, and there are k towers, so the total is O(n²d²/k). With k = 8, this is an 8x reduction in the dominant cost, with a small overhead for mixing.

Towers are a precursor to multi-head attention. The idea of splitting the representation into independent "heads," processing them separately, and recombining them should sound familiar. Transformers use the same trick in multi-head attention (Vaswani et al., 2017, published the same year). The towers approach foreshadows this pattern in the graph domain.

In experiments (Table 8), towers with k = 8 outperform the vanilla GG-NN on 12 out of 13 targets in both joint and individual training settings.

What problem does the master node solve in the MPNN?

It reduces the number of parameters It replaces the readout function It allows distant atoms to communicate quickly by serving as a global scratch space connected to every atom

Chapter 7: Showcase — Message Passing in Action

Below is a small molecule graph. Click Run to watch messages flow between atoms. At each time step, every node sends a message to its neighbors (colored pulses along edges), aggregates incoming messages, and updates its hidden state.

Use the T slider to control how many message passing steps to run. Use the message function toggle to switch between Matrix Multiply and Edge Network. Watch how each node's hidden state (shown by its color intensity) evolves as it accumulates information from an expanding neighborhood.

T = 3 Ready

What to notice: After T = 1 step, each node knows about its immediate neighbors. After T = 2, it knows about neighbors-of-neighbors. After T = 3, information from three hops away has arrived. With T = 6 on this small molecule, every node effectively "sees" the entire graph. The color intensity of each node shows how much information it has accumulated.

In the Edge Network mode, notice that the message pulses vary in color along different edges — the transformation depends on the edge features (bond type and distance). In Matrix Multiply mode, all single bonds use the same transformation, so messages along edges of the same type look identical.

After T = 2 message passing steps, how far can information travel from a given node?

Two hops — information from a node's neighbor's neighbors reaches it after two steps Only to the immediate neighbors To every node in the graph regardless of distance

Chapter 8: The Experiments

The paper's best model combines: edge network message function, GRU update, set2set readout, explicit hydrogen atoms, and distance features. How does it do?

The headline result: The MPNN achieves state of the art on all 13 QM9 targets and predicts DFT to within chemical accuracy on 11 out of 13 targets. The two exceptions are the dipole moment μ and the electronic spatial extent R².

Some key findings from the ablation studies:

Ablation	Effect
Graph only → + Distance	Dramatic improvement. Distance is the single most important input feature for quantum properties.
GG-NN readout → Set2Set	Consistent improvement across most targets. Attention-based pooling captures more structure than summation.
Sparse graph → + Virtual edges	Helps on most targets by enabling long-range communication.
+ Master node	Similar benefits to virtual edges with better scaling properties.
Vanilla → Towers (k=8)	Better on 12/13 targets with 2x speedup for d = 200.
Edge network → Pair message	Pair message is worse despite being more expressive. Edge network wins.
Implicit H → Explicit H	More atoms means slower (10x), but substantially more accurate.

The training setup: 50 random hyperparameter trials per model-target combination. SGD with Adam, batch size 20, 3 million steps (~540 epochs). Message passing steps T between 3 and 8 (any T ≥ 3 works). Set2set computation steps M between 1 and 12.

The graph-only results (no distance information) are also noteworthy. Even without knowing where atoms are in 3D space, the best graph-only MPNN achieves chemical accuracy on 5 out of 13 targets. This suggests the graph topology alone encodes a surprising amount of chemical information.

On the data efficiency front, Table 6 shows how performance scales with training set size. Going from 11k to 110k training examples roughly halves the error on most targets. The authors argue that future work should focus on datasets with larger molecules or more accurate labels, since the model is beginning to saturate on QM9.

What single input feature most dramatically improves MPNN performance on quantum property prediction?

The number of hydrogen atoms Spatial distance between atom pairs — switching from graph-only to distance-aware cuts average error from 2.57 to 0.98 The hybridization state of each atom

Chapter 9: Connections

The lineage of graph neural networks. Before this paper, the landscape of neural networks on graphs was fragmented. Kipf and Welling had GCN. Li et al. had GG-NN. Duvenaud et al. had neural fingerprints. Battaglia et al. had interaction networks. Each community used different notation, different terminology, different abstractions. The MPNN framework showed these are all the same thing.

Before MPNN (pre-2017)

8+ separate GNN architectures, each with its own formalism. Hard to compare or combine.

↓ unified by

MPNN Framework (2017)

One abstraction: M, U, R. All prior work is a special case. Mix and match components.

↓ influenced

Modern GNN Landscape

GAT, GraphSAGE, GIN, SchNet, DimeNet — all described in message passing terms. The framework became the standard language.

Weisfeiler-Leman and expressiveness. Xu et al. (2019, "How Powerful are Graph Neural Networks?") later proved that MPNNs are at most as powerful as the 1-dimensional Weisfeiler-Leman graph isomorphism test. This means there exist non-isomorphic graphs that no MPNN can distinguish. This theoretical limit has driven research into higher-order message passing (k-WL) and geometric GNNs that use 3D coordinates to break symmetry.

From molecules to proteins. The message passing idea extended naturally to proteins (AlphaFold 2 uses equivariant message passing on residue graphs), materials science (Crystal Graph Neural Networks), and drug discovery (molecular generation and property prediction). The MPNN framework provided the common language for all of this work.

Edge networks and geometric deep learning. The edge network idea — using a neural network to produce transformation matrices from edge features — foreshadows the continuous filters of SchNet (2017) and the directional message passing of DimeNet (2020). The insight that interactions should depend continuously on geometry became central to equivariant neural networks.

The lasting impact. This paper has over 7,000 citations. Its main contribution is not any single architecture but the framework itself. By showing that eight different models are all doing the same thing, it gave the community a common language, made it easy to combine ideas, and accelerated progress on graph learning by years. The MPNN framework is now the default way to think about neural networks on graphs.

Paper details. "Neural Message Passing for Quantum Chemistry," Justin Gilmer, Samuel S. Schoenholz, Patrick F. Riley, Oriol Vinyals, George E. Dahl. ICML 2017. arXiv:1704.01212.

← Back to Veanors Hub

What is the theoretical expressiveness limit of MPNNs, as proven by later work?

MPNNs can distinguish all possible graphs MPNNs are at most as powerful as the 1-dimensional Weisfeiler-Leman test — there exist non-isomorphic graphs that no MPNN can tell apart MPNNs cannot learn any graph-level properties

Neural Message Passingfor Quantum Chemistry