Gilmer, Schoenholz, Riley, Vinyals & Dahl — Google Brain / DeepMind, ICML 2017 • arXiv:1704.01212

Neural Message Passing
for Quantum Chemistry

Eight different graph neural networks look completely different on paper. This paper showed they are all doing the same thing — passing messages between atoms. Then it pushed the idea further than anyone had before.

Prerequisites: Basic neural networks + Graph intuition (nodes and edges — we derive the rest)
10
Chapters
1
Simulation
10
Quizzes

Chapter 0: The Problem

You are a chemist. You have a new molecule and you want to know its properties — how much energy holds its atoms together, how it distributes its electrons, whether it vibrates at certain frequencies. The gold-standard method is called Density Functional Theory (DFT), a quantum mechanical simulation. For a single molecule with 9 heavy atoms, DFT takes about an hour on a modern CPU. For a larger molecule, it can take eight hours.

Now imagine you have 134,000 molecules. That is over 15 CPU-years of computation.

What if a neural network could learn to approximate the DFT calculation, predicting those same properties in a fraction of a second? That is the promise of this paper: train a neural network on molecules, feed it new molecules it has never seen, and get quantum-accurate predictions 300,000 times faster than DFT.

The challenge: Molecules are not images or sequences. They are graphs — atoms connected by bonds, with no fixed ordering. Any neural network that operates on molecules must be invariant to graph isomorphism: relabeling the atoms should not change the prediction.

By 2017, at least eight different research groups had proposed neural network architectures for learning on graphs. Each paper introduced its own notation, its own framework, its own way of thinking about the problem. From the outside, they looked like eight different ideas.

Gilmer et al. looked at all eight and made a striking observation: they are all doing the same thing. Every single one can be described as nodes passing messages to their neighbors, aggregating those messages, and updating their state. The only differences are in which message function, which update function, and which readout function they use.

The paper calls this shared abstraction Message Passing Neural Networks (MPNNs). It is not a new model — it is a unifying lens. And once you have that lens, you can mix and match components to find combinations that nobody had tried.

Why can't we simply feed a molecule's atoms as a flat list into a standard neural network?

Chapter 1: Molecules as Graphs

A molecule is a natural graph. Each atom is a node, and each bond is an edge. Take ethanol (the alcohol in your drink): two carbons, one oxygen, six hydrogens, connected by single bonds. That is a graph with nine nodes and eight edges.

Each node carries features — information about the atom it represents:

FeatureDescriptionEncoding
Atom typeH, C, N, O, or FOne-hot (5 dims)
Atomic numberNumber of protonsInteger
AcceptorAccepts electrons?Binary
DonorDonates electrons?Binary
AromaticIn an aromatic ring?Binary
Hybridizationsp, sp2, or sp3One-hot
Num. HydrogensAttached H atomsInteger

Each edge also carries features. At minimum, the bond type (single, double, triple, or aromatic). When spatial information is available, the edge also stores the Euclidean distance between the two atoms.

Two input settings. The paper considers both (1) the graph-only setting, where only atom types and bond types are known, and (2) the spatial setting, where 3D atom positions are also provided. The spatial setting is much easier — distances tell you a lot about molecular properties. The graph-only setting is harder but more broadly useful, since computing 3D geometry is itself expensive.

The paper uses the QM9 dataset: 134,000 small organic molecules (up to 9 heavy atoms each), with 13 quantum-chemical properties computed by DFT. These properties span four broad categories:

Energetics
Atomization energy (U0, U, H, G) — how tightly bound the atoms are
Vibrations
ZPVE and highest vibrational frequency ω1
Electronic
HOMO, LUMO, energy gap Δε
Spatial
Dipole moment μ, polarizability α, electronic extent R2

Success is measured against two benchmarks: DFT error (matching the DFT simulation) and chemical accuracy (a tighter target established by chemists). Achieving chemical accuracy on all 13 targets would mean the neural network is accurate enough to replace DFT for practical purposes.

In the QM9 dataset, what does each "sample" represent?

Chapter 2: The MPNN Framework

Here is the core idea of the paper. Every graph neural network the authors surveyed does the same three things. They just implement them differently.

Phase 1: Message Passing (runs for T time steps)

Every node v in the graph has a hidden state hvt. At each time step, two things happen:

Step 1: Compute messages. For each node v, gather a message from every neighbor w:

mvt+1 = ∑w ∈ N(v) Mt(hvt, hwt, evw)

where Mt is the message function, hwt is the neighbor's current state, and evw is the edge feature between v and w. The sum aggregates messages from all neighbors.

Step 2: Update states. Each node updates its hidden state using the aggregated message:

hvt+1 = Ut(hvt, mvt+1)

where Ut is the update function. This takes the node's old state and the incoming messages and produces a new state.

Phase 2: Readout

After T steps of message passing, we need a single prediction for the whole molecule. The readout function R takes all the final node states and produces a graph-level output:

ŷ = R({hvT | v ∈ G})
The unifying insight: M (message), U (update), R (readout). That is the entire framework. Every graph neural network the authors surveyed — GG-NN, Interaction Networks, Molecular Fingerprints, Deep Tensor, Laplacian methods, and more — is just a specific choice of M, U, and R. The paper shows this explicitly for eight existing models.

Why is this useful? Because once you realize all these models are instances of the same framework, you can mix and match. Take the message function from one model, the readout from another, and see if the combination works better. That is exactly what the paper does.

One critical requirement: the readout function R must be invariant to permutations of the node states. Shuffling the atoms should not change the molecular property. This rules out simple concatenation but allows operations like summation, attention-weighted pooling, and set2set.

What are the three components that define any MPNN?

Chapter 3: Message Functions

The message function M decides what information a neighbor sends. Different choices lead to different models. Let's walk through three key variants the paper explores.

1. Matrix Multiplication (from Gated Graph Neural Networks)

M(hv, hw, evw) = Aevw hw

There is one learned matrix A for each discrete edge type (single bond, double bond, etc.). The message is just the neighbor's hidden state multiplied by the edge-type-specific matrix. Simple, but it requires edge types to be discrete — you cannot use continuous features like distance.

2. Edge Network (proposed in this paper)

M(hv, hw, evw) = A(evw) hw

Instead of a fixed matrix per edge type, a small neural network A takes the edge feature vector evw and outputs a d × d matrix. This matrix then multiplies the neighbor's state. The edge network can handle continuous edge features like spatial distance, not just discrete bond types.

Why edge networks matter: With matrix multiplication, single bonds and double bonds get different learned matrices, but you cannot smoothly interpolate. With edge networks, the message transformation depends continuously on the edge features. A bond at 1.2 Angstroms and one at 1.3 Angstroms will produce slightly different transformation matrices.

3. Pair Message (from Interaction Networks)

mwv = f(hwt, hvt, evw)

A neural network f takes the concatenation of both endpoint states plus the edge feature. This is the most expressive option — the message depends on both the sender and the receiver. However, the paper found it performs worse than the edge network in practice (Table 9). The extra expressiveness may make optimization harder.

Message FunctionDepends on hw?Depends on hv?Continuous edges?
Matrix MultiplyYesNoNo
Edge NetworkYesNoYes
Pair MessageYesYesYes

The paper's best results use the edge network: expressive enough to handle continuous features, but not so expressive that it becomes hard to train.

What is the key advantage of the edge network message function over the matrix multiply approach?

Chapter 4: Update and Readout

After a node collects messages from its neighbors, it needs to combine them with its own current state to produce a new state. That is the update function U.

The paper uses a Gated Recurrent Unit (GRU) for the update, the same recurrent cell used in sequence models:

hvt+1 = GRU(hvt, mvt+1)

Think of the GRU as a learned gate that decides how much of the old state to keep and how much to overwrite with the new message. This is shared across all time steps (weight tying), so the same update function is applied at step 1, step 2, and so on.

Why a GRU? The message passing process runs for T steps. At each step, a node refines its state based on new messages. This is analogous to a recurrent process — the same operation applied repeatedly, with the state accumulating information over time. A GRU handles this gracefully because its gating mechanism prevents information from being washed away over many steps.

The initial hidden state hv0 is simply the atom's feature vector xv, padded with zeros up to dimension d.

Now for the readout function R. After T message passing steps, each node has a final state hvT that encodes information about its local neighborhood (and, through multiple hops, the broader molecular context). We need to collapse all node states into a single graph-level prediction.

The paper compares two readout approaches:

GG-NN Readout
R = ∑v σ(i(hvT, hv0)) ⊙ j(hvT)
Two neural networks i and j, element-wise gating, then sum
vs
Set2Set Readout
An attention-based pooling mechanism that iterates M times over the node set, producing a permutation-invariant embedding

The set2set model (from Vinyals et al., 2015) is more powerful. Instead of a single pass over the nodes, it runs M steps of content-based attention, repeatedly querying the node states. At each step, it attends to different parts of the molecule, building up a richer graph-level representation. Think of it as reading the molecule multiple times, each time focusing on different atoms.

Set2Set is key. Switching from GG-NN readout to set2set improved results on nearly every target (Table 7 in the paper). Simple summation throws away information about which atoms contributed what. Set2set preserves more of the structural information by attending to the node set multiple times.
Why must the readout function R be invariant to the ordering of the nodes?

Chapter 5: Edge Networks

Let's look more carefully at the edge network, because it is the paper's strongest contribution beyond the framework itself.

In the matrix multiplication approach (GG-NN), each discrete edge type gets its own learned matrix. A single bond uses matrix A1, a double bond uses A2, and so on. But what if your edge features are continuous? What if you know the atoms are 1.54 Angstroms apart?

You could discretize: bin the distances into 10 buckets and treat each bucket as a separate edge type. The paper does this and calls it "distance bins." But binning is lossy. Two bonds at 1.51 and 1.59 Angstroms end up in the same bucket.

The edge network replaces the lookup table with a neural network:

A(evw) = NN(evw) ∈ Rd × d

The input evw is a 5-dimensional vector: the Euclidean distance plus a one-hot encoding of the bond type. The neural network outputs a d × d matrix. This matrix is then multiplied by the neighbor's hidden state to produce the message:

mw→v = A(evw) · hwt
The edge network as a learned interaction law. In physics, the force between two particles depends on their distance. Different interaction laws (gravity, electrostatics, van der Waals) have different functional forms. The edge network learns a transformation that depends on distance and bond type — effectively learning an interaction law from data.

How big is this neural network? It takes a 5-dimensional input and outputs d2 values (which are reshaped into a d × d matrix). For d = 200, that is 40,000 output values. The network is small (one or two hidden layers), but it is applied to every edge at every message passing step.

The results are clear. Comparing the edge network (enn) with the pair message function (pm) on joint training across all 13 targets, the edge network wins on nearly every target (Table 9). The average MAE across targets: edge network 1.53 vs pair message 3.98. The edge network is not just slightly better — it is dramatically better.

Input FeaturesMessage FunctionAvg. MAE (ratio to chem. accuracy)
Graph onlyMatrix Multiply2.57
+ DistanceEdge Network0.98
+ Distance + explicit HEdge Network0.68

Adding distance information and using the edge network together cuts the error by more than 3x. Adding explicit hydrogen atoms as nodes (instead of encoding them as a feature) further reduces error. More nodes means a bigger graph and slower training (roughly 10x), but the accuracy gains are substantial.

What does the edge network output?

Chapter 6: Virtual Elements & Towers

Two practical problems emerge when you try to scale MPNNs. First: information travels slowly. After T steps of message passing, a node can only "see" atoms within T hops in the graph. If two atoms are on opposite ends of a molecule and T = 3, they never communicate directly.

Second: computation grows as O(n2d2) for dense graphs, where n is the number of atoms and d is the hidden dimension. For large molecules or large hidden states, this becomes expensive.

The paper proposes two solutions for the communication problem:

Virtual Edges
Add a special edge type between every pair of non-bonded atoms. Now every atom is one hop from every other atom. Information travels globally in a single step.
or
Master Node
Add a single "master" node connected to every atom with a special edge type. It acts as a global scratch space — every atom reads from and writes to it at every step.

Both approaches let distant atoms communicate without needing many message passing steps. The master node is particularly elegant: its complexity scales as O(|E|d2 + nd2master), so you can give it a large hidden dimension without blowing up the cost of edge-level computation.

For the scalability problem, the paper introduces towers. Instead of running one propagation with a d-dimensional hidden state, split each node's state into k copies of dimension d/k. Run message passing on each copy separately (with separate learned functions), then mix the k copies back together with a neural network:

hvt,1, hvt,2, …, hvt,k = g(h̃vt,1, h̃vt,2, …, h̃vt,k)

Each "tower" does O(n2(d/k)2) work, and there are k towers, so the total is O(n2d2/k). With k = 8, this is an 8x reduction in the dominant cost, with a small overhead for mixing.

Towers are a precursor to multi-head attention. The idea of splitting the representation into independent "heads," processing them separately, and recombining them should sound familiar. Transformers use the same trick in multi-head attention (Vaswani et al., 2017, published the same year). The towers approach foreshadows this pattern in the graph domain.

In experiments (Table 8), towers with k = 8 outperform the vanilla GG-NN on 12 out of 13 targets in both joint and individual training settings.

What problem does the master node solve in the MPNN?

Chapter 7: Showcase — Message Passing in Action

Below is a small molecule graph. Click Run to watch messages flow between atoms. At each time step, every node sends a message to its neighbors (colored pulses along edges), aggregates incoming messages, and updates its hidden state.

Use the T slider to control how many message passing steps to run. Use the message function toggle to switch between Matrix Multiply and Edge Network. Watch how each node's hidden state (shown by its color intensity) evolves as it accumulates information from an expanding neighborhood.

Ready
What to notice: After T = 1 step, each node knows about its immediate neighbors. After T = 2, it knows about neighbors-of-neighbors. After T = 3, information from three hops away has arrived. With T = 6 on this small molecule, every node effectively "sees" the entire graph. The color intensity of each node shows how much information it has accumulated.

In the Edge Network mode, notice that the message pulses vary in color along different edges — the transformation depends on the edge features (bond type and distance). In Matrix Multiply mode, all single bonds use the same transformation, so messages along edges of the same type look identical.

After T = 2 message passing steps, how far can information travel from a given node?

Chapter 8: The Experiments

The paper's best model combines: edge network message function, GRU update, set2set readout, explicit hydrogen atoms, and distance features. How does it do?

The headline result: The MPNN achieves state of the art on all 13 QM9 targets and predicts DFT to within chemical accuracy on 11 out of 13 targets. The two exceptions are the dipole moment μ and the electronic spatial extent R2.

Some key findings from the ablation studies:

AblationEffect
Graph only → + DistanceDramatic improvement. Distance is the single most important input feature for quantum properties.
GG-NN readout → Set2SetConsistent improvement across most targets. Attention-based pooling captures more structure than summation.
Sparse graph → + Virtual edgesHelps on most targets by enabling long-range communication.
+ Master nodeSimilar benefits to virtual edges with better scaling properties.
Vanilla → Towers (k=8)Better on 12/13 targets with 2x speedup for d = 200.
Edge network → Pair messagePair message is worse despite being more expressive. Edge network wins.
Implicit H → Explicit HMore atoms means slower (10x), but substantially more accurate.

The training setup: 50 random hyperparameter trials per model-target combination. SGD with Adam, batch size 20, 3 million steps (~540 epochs). Message passing steps T between 3 and 8 (any T ≥ 3 works). Set2set computation steps M between 1 and 12.

The graph-only results (no distance information) are also noteworthy. Even without knowing where atoms are in 3D space, the best graph-only MPNN achieves chemical accuracy on 5 out of 13 targets. This suggests the graph topology alone encodes a surprising amount of chemical information.

On the data efficiency front, Table 6 shows how performance scales with training set size. Going from 11k to 110k training examples roughly halves the error on most targets. The authors argue that future work should focus on datasets with larger molecules or more accurate labels, since the model is beginning to saturate on QM9.

What single input feature most dramatically improves MPNN performance on quantum property prediction?

Chapter 9: Connections

The lineage of graph neural networks. Before this paper, the landscape of neural networks on graphs was fragmented. Kipf and Welling had GCN. Li et al. had GG-NN. Duvenaud et al. had neural fingerprints. Battaglia et al. had interaction networks. Each community used different notation, different terminology, different abstractions. The MPNN framework showed these are all the same thing.

Before MPNN (pre-2017)
8+ separate GNN architectures, each with its own formalism. Hard to compare or combine.
↓ unified by
MPNN Framework (2017)
One abstraction: M, U, R. All prior work is a special case. Mix and match components.
↓ influenced
Modern GNN Landscape
GAT, GraphSAGE, GIN, SchNet, DimeNet — all described in message passing terms. The framework became the standard language.

Weisfeiler-Leman and expressiveness. Xu et al. (2019, "How Powerful are Graph Neural Networks?") later proved that MPNNs are at most as powerful as the 1-dimensional Weisfeiler-Leman graph isomorphism test. This means there exist non-isomorphic graphs that no MPNN can distinguish. This theoretical limit has driven research into higher-order message passing (k-WL) and geometric GNNs that use 3D coordinates to break symmetry.

From molecules to proteins. The message passing idea extended naturally to proteins (AlphaFold 2 uses equivariant message passing on residue graphs), materials science (Crystal Graph Neural Networks), and drug discovery (molecular generation and property prediction). The MPNN framework provided the common language for all of this work.

Edge networks and geometric deep learning. The edge network idea — using a neural network to produce transformation matrices from edge features — foreshadows the continuous filters of SchNet (2017) and the directional message passing of DimeNet (2020). The insight that interactions should depend continuously on geometry became central to equivariant neural networks.

The lasting impact. This paper has over 7,000 citations. Its main contribution is not any single architecture but the framework itself. By showing that eight different models are all doing the same thing, it gave the community a common language, made it easy to combine ideas, and accelerated progress on graph learning by years. The MPNN framework is now the default way to think about neural networks on graphs.

Paper details. "Neural Message Passing for Quantum Chemistry," Justin Gilmer, Samuel S. Schoenholz, Patrick F. Riley, Oriol Vinyals, George E. Dahl. ICML 2017. arXiv:1704.01212.

← Back to Veanors Hub

What is the theoretical expressiveness limit of MPNNs, as proven by later work?