Molecules, social networks, knowledge graphs, road maps — data with no grid, just nodes and connections. How GNNs learn by message passing: each node repeatedly gathers information from its neighbors, until every node understands its place in the whole graph.
So much of the world is naturally a graph — a set of nodes connected by edges. A molecule is atoms (nodes) bonded together (edges). A social network is people connected by friendships. A knowledge graph links entities by relations. Road networks, citation graphs, protein interactions, recommendation systems — all graphs. The information isn’t just in the nodes; it’s in how they connect.
But like a point cloud, a graph has no grid and no fixed order. Nodes have different numbers of neighbors (an atom bonds to 1–4 others; a celebrity has millions of followers). There’s no “node to the right.” You cannot run a convolution, and you cannot flatten the nodes into a vector (there’s no canonical order). Graph Neural Networks are built for exactly this: they learn from graph-structured data by letting each node exchange information with its neighbors, respecting the connectivity. This lesson builds the GNN from its one core operation — message passing.
A molecule, a social network, a grid — the first two are irregular graphs (varying connections), the last is the special regular case a CNN handles. Toggle between examples and see the irregularity.
Let’s pin down the ingredients. A graph has nodes, each with a feature vector (an atom’s type and charge; a user’s profile). It has edges connecting nodes, optionally with their own features (a bond’s type; a friendship’s strength). And it has structure — the connectivity pattern, who’s linked to whom, captured by an adjacency list.
The defining difficulty is irregularity. In an image, every pixel has exactly the same neighborhood shape (a fixed grid around it). In a graph, node A might have 2 neighbors and node B might have 500. Whatever operation we design must handle any number of neighbors, and (like point clouds) must be permutation invariant — the neighbors of a node have no order, so processing them must give the same result regardless of how they’re listed. These two requirements — handle variable neighbor counts, ignore neighbor order — shape everything that follows.
Almost every GNN is an instance of one idea: message passing. To update a node, do three steps. 1. Message: each neighbor computes a message (a function of its own feature, and possibly the edge). 2. Aggregate: combine all the incoming messages with a permutation-invariant function (sum, mean, or max). 3. Update: combine the aggregated message with the node’s current feature (through an MLP) to get its new feature.
Do this for every node simultaneously, and you have one GNN layer. After one layer, each node’s representation reflects itself plus its immediate neighbors (its 1-hop neighborhood). Stack a second layer and each node now indirectly hears from neighbors-of-neighbors (2 hops away), because its neighbors already absorbed their neighbors. After L layers, every node’s representation summarizes its entire L-hop neighborhood. Information diffuses across the graph, hop by hop — nodes gradually understand their place in the structure.
Click a node: its neighbors send messages (arrows), which are aggregated and combined with the node’s own feature to update it. Every node does this at once — that’s one layer.
The aggregate step is where permutation invariance lives — and it’s the same lesson as point clouds. A node’s neighbors have no order, so combining their messages must use a symmetric function: sum, mean, or max. Shuffle the neighbor list and the aggregate is unchanged. (Concatenating them in some order would not be invariant — it would break the moment a neighbor is relisted.)
The choice of aggregator matters. Mean captures the average neighbor (good for distributions, loses count information). Sum preserves count and is provably more expressive (it can tell apart a node with 2 neighbors from one with 4 identical neighbors — mean cannot). Max captures the most salient neighbor. Theoretical work (the GIN paper) showed sum is the most powerful for distinguishing graph structures. The aggregator is a key design choice, but the principle is fixed: it must be symmetric, or your GNN gives different answers for the same graph drawn differently.
A node’s neighbor messages, combined three ways. Press shuffle: the neighbor order changes, but sum/mean/max are unchanged. Switch aggregator to compare what each captures.
The Graph Convolutional Network (Kipf & Welling, 2017) is the simplest, most famous GNN. Its message passing is: each node takes a normalized average of its neighbors’ features (including itself), then applies a linear layer and a nonlinearity. The “normalized” part divides by node degrees so that high-degree nodes don’t dominate — a message from a node with many connections is scaled down, since it’s “shouting to everyone.”
A node has feature value 1.0 and two neighbors with values 4.0 and 7.0. A simple GCN-style update (mean of self + neighbors, then a weight): mean = (1.0 + 4.0 + 7.0) / 3 = 4.0. Apply a learned weight of, say, 0.5: new feature = 0.5 × 4.0 = 2.0 (before the nonlinearity). The node’s value moved from 1.0 toward its neighborhood’s average — it has absorbed local context. Run this on every node, and the whole graph’s features smooth toward local consensus, layer by layer. GCN is essentially a learnable, structure-aware smoothing — spectral graph theory in disguise, but you can understand it entirely as “averaging your neighbors, normalized by degree.”
A node updates to the (degree-normalized) average of its neighbors plus itself. Drag the node’s starting value and watch it pulled toward its neighborhood’s mean — absorbing local context.
GCN treats all neighbors equally (just degree-normalized). But not all neighbors are equally relevant. The Graph Attention Network (GAT, 2018) lets each node learn how much to weight each neighbor. For each neighbor, it computes an attention score (from the two nodes’ features), softmaxes the scores over the neighborhood, and aggregates the neighbor messages as a weighted sum — paying more attention to the important neighbors.
This is exactly attention from transformers, applied over a node’s graph neighborhood instead of a sequence. It adapts to content: in a citation graph, a paper might weight a foundational reference heavily and an incidental one lightly. Attention is naturally permutation invariant (a weighted set operation) and handles variable neighbor counts. GAT often outperforms GCN because the learned weights let the model focus — and it reveals, interpretably, which neighbors mattered. It’s the same spectrum from the point-cloud lesson: GCN’s fixed averaging is the simple aggregator, GAT’s attention is the richer one.
A node attends to its neighbors with learned weights (arrow thickness). Unlike GCN’s equal averaging, GAT emphasizes the relevant neighbors. Drag to change which neighbor is most relevant.
More layers means a larger receptive field — each node hears from farther across the graph. So why not stack 20 layers? Because of over-smoothing. Each message-passing layer averages neighbors, pulling connected nodes’ features toward each other. Do this too many times and every node’s representation converges to nearly the same value — the whole graph blurs into mush, and nodes become indistinguishable. Useful signal is smoothed away.
So GNNs face a tension: enough depth to capture long-range structure, but not so much that everything collapses. In practice, most GNNs are shallow (2–4 layers). Fixes exist — residual/skip connections (let nodes keep their own signal), normalization, and architectures designed to resist smoothing — and there’s a related issue called over-squashing (too much information forced through too few edges, a bottleneck for long-range messages). But the core intuition is simple: averaging is smoothing, and too much smoothing erases the very distinctions you’re trying to learn.
Node colors (features) at increasing depth. A few layers spread useful local context; too many and every node collapses to the same color — over-smoothed mush. Drag the number of layers and watch distinctions vanish.
Watch information diffuse across a graph. A signal starts at one node; each layer of message passing spreads it one hop further. With the right depth, a node-classification signal propagates to label the graph; with too much, it over-smooths. Inject a signal, step through the layers, and see the L-hop receptive field grow — then collapse.
Click a node to inject a signal, then press Step to apply message-passing layers. Watch the signal spread one hop per layer (the receptive field). Keep stepping and it over-smooths — the whole graph homogenizes. The readout tracks reach and smoothing.
The sweet spot is a few layers: enough for the signal to reach the relevant neighborhood, not so many that everything blurs together. That balance — receptive field versus over-smoothing — is the central practical tension in GNN design, and watching it play out is the whole intuition.
GNNs handle three task levels, all from the same node representations:
Applications are everywhere: drug discovery (molecules as graphs — predicting properties, generating candidates), recommendation (user–item graphs), traffic/maps (road networks — Google Maps ETA uses GNNs), fraud detection, knowledge graphs. And here’s the beautiful unifying view: a transformer is a GNN on a fully-connected graph with attention as the aggregator; a point cloud net is a GNN on a nearest-neighbor graph; a CNN is a GNN on a grid graph. Message passing over neighbors, with a symmetric aggregator, is a deeply general template — GNNs just make the graph explicit.
The same message-passing idea on different graphs: a grid (= CNN), a nearest-neighbor graph (= point net), a fully-connected graph with attention (= transformer), and a general graph (= GNN). Drag to see them as one family.
| GNN | Aggregator | Note |
|---|---|---|
| GCN | degree-normalized mean | simplest; structure-aware smoothing |
| GraphSAGE | sample + mean/max/LSTM | scalable to huge graphs |
| GAT | attention-weighted sum | learns neighbor importance |
| GIN | sum + MLP | most expressive (distinguishes structures) |
→ Point Clouds — a nearest-neighbor graph; same symmetric aggregation
→ Attention Variants — the attention GAT borrows
→ The Transformer — a GNN on a fully-connected graph
→ Pooling & Aggregation — the symmetric readout for graph-level tasks