Eight different graph neural networks look completely different on paper. This paper showed they are all doing the same thing — passing messages between atoms. Then it pushed the idea further than anyone had before.
You are a chemist. You have a new molecule and you want to know its properties — how much energy holds its atoms together, how it distributes its electrons, whether it vibrates at certain frequencies. The gold-standard method is called Density Functional Theory (DFT), a quantum mechanical simulation. For a single molecule with 9 heavy atoms, DFT takes about an hour on a modern CPU. For a larger molecule, it can take eight hours.
Now imagine you have 134,000 molecules. That is over 15 CPU-years of computation.
What if a neural network could learn to approximate the DFT calculation, predicting those same properties in a fraction of a second? That is the promise of this paper: train a neural network on molecules, feed it new molecules it has never seen, and get quantum-accurate predictions 300,000 times faster than DFT.
By 2017, at least eight different research groups had proposed neural network architectures for learning on graphs. Each paper introduced its own notation, its own framework, its own way of thinking about the problem. From the outside, they looked like eight different ideas.
Gilmer et al. looked at all eight and made a striking observation: they are all doing the same thing. Every single one can be described as nodes passing messages to their neighbors, aggregating those messages, and updating their state. The only differences are in which message function, which update function, and which readout function they use.
The paper calls this shared abstraction Message Passing Neural Networks (MPNNs). It is not a new model — it is a unifying lens. And once you have that lens, you can mix and match components to find combinations that nobody had tried.
A molecule is a natural graph. Each atom is a node, and each bond is an edge. Take ethanol (the alcohol in your drink): two carbons, one oxygen, six hydrogens, connected by single bonds. That is a graph with nine nodes and eight edges.
Each node carries features — information about the atom it represents:
| Feature | Description | Encoding |
|---|---|---|
| Atom type | H, C, N, O, or F | One-hot (5 dims) |
| Atomic number | Number of protons | Integer |
| Acceptor | Accepts electrons? | Binary |
| Donor | Donates electrons? | Binary |
| Aromatic | In an aromatic ring? | Binary |
| Hybridization | sp, sp2, or sp3 | One-hot |
| Num. Hydrogens | Attached H atoms | Integer |
Each edge also carries features. At minimum, the bond type (single, double, triple, or aromatic). When spatial information is available, the edge also stores the Euclidean distance between the two atoms.
The paper uses the QM9 dataset: 134,000 small organic molecules (up to 9 heavy atoms each), with 13 quantum-chemical properties computed by DFT. These properties span four broad categories:
Success is measured against two benchmarks: DFT error (matching the DFT simulation) and chemical accuracy (a tighter target established by chemists). Achieving chemical accuracy on all 13 targets would mean the neural network is accurate enough to replace DFT for practical purposes.
Here is the core idea of the paper. Every graph neural network the authors surveyed does the same three things. They just implement them differently.
Phase 1: Message Passing (runs for T time steps)
Every node v in the graph has a hidden state hvt. At each time step, two things happen:
Step 1: Compute messages. For each node v, gather a message from every neighbor w:
where Mt is the message function, hwt is the neighbor's current state, and evw is the edge feature between v and w. The sum aggregates messages from all neighbors.
Step 2: Update states. Each node updates its hidden state using the aggregated message:
where Ut is the update function. This takes the node's old state and the incoming messages and produces a new state.
Phase 2: Readout
After T steps of message passing, we need a single prediction for the whole molecule. The readout function R takes all the final node states and produces a graph-level output:
Why is this useful? Because once you realize all these models are instances of the same framework, you can mix and match. Take the message function from one model, the readout from another, and see if the combination works better. That is exactly what the paper does.
One critical requirement: the readout function R must be invariant to permutations of the node states. Shuffling the atoms should not change the molecular property. This rules out simple concatenation but allows operations like summation, attention-weighted pooling, and set2set.
The message function M decides what information a neighbor sends. Different choices lead to different models. Let's walk through three key variants the paper explores.
1. Matrix Multiplication (from Gated Graph Neural Networks)
There is one learned matrix A for each discrete edge type (single bond, double bond, etc.). The message is just the neighbor's hidden state multiplied by the edge-type-specific matrix. Simple, but it requires edge types to be discrete — you cannot use continuous features like distance.
2. Edge Network (proposed in this paper)
Instead of a fixed matrix per edge type, a small neural network A takes the edge feature vector evw and outputs a d × d matrix. This matrix then multiplies the neighbor's state. The edge network can handle continuous edge features like spatial distance, not just discrete bond types.
3. Pair Message (from Interaction Networks)
A neural network f takes the concatenation of both endpoint states plus the edge feature. This is the most expressive option — the message depends on both the sender and the receiver. However, the paper found it performs worse than the edge network in practice (Table 9). The extra expressiveness may make optimization harder.
| Message Function | Depends on hw? | Depends on hv? | Continuous edges? |
|---|---|---|---|
| Matrix Multiply | Yes | No | No |
| Edge Network | Yes | No | Yes |
| Pair Message | Yes | Yes | Yes |
The paper's best results use the edge network: expressive enough to handle continuous features, but not so expressive that it becomes hard to train.
After a node collects messages from its neighbors, it needs to combine them with its own current state to produce a new state. That is the update function U.
The paper uses a Gated Recurrent Unit (GRU) for the update, the same recurrent cell used in sequence models:
Think of the GRU as a learned gate that decides how much of the old state to keep and how much to overwrite with the new message. This is shared across all time steps (weight tying), so the same update function is applied at step 1, step 2, and so on.
The initial hidden state hv0 is simply the atom's feature vector xv, padded with zeros up to dimension d.
Now for the readout function R. After T message passing steps, each node has a final state hvT that encodes information about its local neighborhood (and, through multiple hops, the broader molecular context). We need to collapse all node states into a single graph-level prediction.
The paper compares two readout approaches:
The set2set model (from Vinyals et al., 2015) is more powerful. Instead of a single pass over the nodes, it runs M steps of content-based attention, repeatedly querying the node states. At each step, it attends to different parts of the molecule, building up a richer graph-level representation. Think of it as reading the molecule multiple times, each time focusing on different atoms.
Let's look more carefully at the edge network, because it is the paper's strongest contribution beyond the framework itself.
In the matrix multiplication approach (GG-NN), each discrete edge type gets its own learned matrix. A single bond uses matrix A1, a double bond uses A2, and so on. But what if your edge features are continuous? What if you know the atoms are 1.54 Angstroms apart?
You could discretize: bin the distances into 10 buckets and treat each bucket as a separate edge type. The paper does this and calls it "distance bins." But binning is lossy. Two bonds at 1.51 and 1.59 Angstroms end up in the same bucket.
The edge network replaces the lookup table with a neural network:
The input evw is a 5-dimensional vector: the Euclidean distance plus a one-hot encoding of the bond type. The neural network outputs a d × d matrix. This matrix is then multiplied by the neighbor's hidden state to produce the message:
How big is this neural network? It takes a 5-dimensional input and outputs d2 values (which are reshaped into a d × d matrix). For d = 200, that is 40,000 output values. The network is small (one or two hidden layers), but it is applied to every edge at every message passing step.
The results are clear. Comparing the edge network (enn) with the pair message function (pm) on joint training across all 13 targets, the edge network wins on nearly every target (Table 9). The average MAE across targets: edge network 1.53 vs pair message 3.98. The edge network is not just slightly better — it is dramatically better.
| Input Features | Message Function | Avg. MAE (ratio to chem. accuracy) |
|---|---|---|
| Graph only | Matrix Multiply | 2.57 |
| + Distance | Edge Network | 0.98 |
| + Distance + explicit H | Edge Network | 0.68 |
Adding distance information and using the edge network together cuts the error by more than 3x. Adding explicit hydrogen atoms as nodes (instead of encoding them as a feature) further reduces error. More nodes means a bigger graph and slower training (roughly 10x), but the accuracy gains are substantial.
Two practical problems emerge when you try to scale MPNNs. First: information travels slowly. After T steps of message passing, a node can only "see" atoms within T hops in the graph. If two atoms are on opposite ends of a molecule and T = 3, they never communicate directly.
Second: computation grows as O(n2d2) for dense graphs, where n is the number of atoms and d is the hidden dimension. For large molecules or large hidden states, this becomes expensive.
The paper proposes two solutions for the communication problem:
Both approaches let distant atoms communicate without needing many message passing steps. The master node is particularly elegant: its complexity scales as O(|E|d2 + nd2master), so you can give it a large hidden dimension without blowing up the cost of edge-level computation.
For the scalability problem, the paper introduces towers. Instead of running one propagation with a d-dimensional hidden state, split each node's state into k copies of dimension d/k. Run message passing on each copy separately (with separate learned functions), then mix the k copies back together with a neural network:
Each "tower" does O(n2(d/k)2) work, and there are k towers, so the total is O(n2d2/k). With k = 8, this is an 8x reduction in the dominant cost, with a small overhead for mixing.
In experiments (Table 8), towers with k = 8 outperform the vanilla GG-NN on 12 out of 13 targets in both joint and individual training settings.
Below is a small molecule graph. Click Run to watch messages flow between atoms. At each time step, every node sends a message to its neighbors (colored pulses along edges), aggregates incoming messages, and updates its hidden state.
Use the T slider to control how many message passing steps to run. Use the message function toggle to switch between Matrix Multiply and Edge Network. Watch how each node's hidden state (shown by its color intensity) evolves as it accumulates information from an expanding neighborhood.
In the Edge Network mode, notice that the message pulses vary in color along different edges — the transformation depends on the edge features (bond type and distance). In Matrix Multiply mode, all single bonds use the same transformation, so messages along edges of the same type look identical.
The paper's best model combines: edge network message function, GRU update, set2set readout, explicit hydrogen atoms, and distance features. How does it do?
Some key findings from the ablation studies:
| Ablation | Effect |
|---|---|
| Graph only → + Distance | Dramatic improvement. Distance is the single most important input feature for quantum properties. |
| GG-NN readout → Set2Set | Consistent improvement across most targets. Attention-based pooling captures more structure than summation. |
| Sparse graph → + Virtual edges | Helps on most targets by enabling long-range communication. |
| + Master node | Similar benefits to virtual edges with better scaling properties. |
| Vanilla → Towers (k=8) | Better on 12/13 targets with 2x speedup for d = 200. |
| Edge network → Pair message | Pair message is worse despite being more expressive. Edge network wins. |
| Implicit H → Explicit H | More atoms means slower (10x), but substantially more accurate. |
The training setup: 50 random hyperparameter trials per model-target combination. SGD with Adam, batch size 20, 3 million steps (~540 epochs). Message passing steps T between 3 and 8 (any T ≥ 3 works). Set2set computation steps M between 1 and 12.
The graph-only results (no distance information) are also noteworthy. Even without knowing where atoms are in 3D space, the best graph-only MPNN achieves chemical accuracy on 5 out of 13 targets. This suggests the graph topology alone encodes a surprising amount of chemical information.
On the data efficiency front, Table 6 shows how performance scales with training set size. Going from 11k to 110k training examples roughly halves the error on most targets. The authors argue that future work should focus on datasets with larger molecules or more accurate labels, since the model is beginning to saturate on QM9.
The lineage of graph neural networks. Before this paper, the landscape of neural networks on graphs was fragmented. Kipf and Welling had GCN. Li et al. had GG-NN. Duvenaud et al. had neural fingerprints. Battaglia et al. had interaction networks. Each community used different notation, different terminology, different abstractions. The MPNN framework showed these are all the same thing.
Weisfeiler-Leman and expressiveness. Xu et al. (2019, "How Powerful are Graph Neural Networks?") later proved that MPNNs are at most as powerful as the 1-dimensional Weisfeiler-Leman graph isomorphism test. This means there exist non-isomorphic graphs that no MPNN can distinguish. This theoretical limit has driven research into higher-order message passing (k-WL) and geometric GNNs that use 3D coordinates to break symmetry.
From molecules to proteins. The message passing idea extended naturally to proteins (AlphaFold 2 uses equivariant message passing on residue graphs), materials science (Crystal Graph Neural Networks), and drug discovery (molecular generation and property prediction). The MPNN framework provided the common language for all of this work.
Edge networks and geometric deep learning. The edge network idea — using a neural network to produce transformation matrices from edge features — foreshadows the continuous filters of SchNet (2017) and the directional message passing of DimeNet (2020). The insight that interactions should depend continuously on geometry became central to equivariant neural networks.
Paper details. "Neural Message Passing for Quantum Chemistry," Justin Gilmer, Samuel S. Schoenholz, Patrick F. Riley, Oriol Vinyals, George E. Dahl. ICML 2017. arXiv:1704.01212.