Modalities & Methods

Graph Neural Networks

Molecules, social networks, knowledge graphs, road maps — data with no grid, just nodes and connections. How GNNs learn by message passing: each node repeatedly gathers information from its neighbors, until every node understands its place in the whole graph.

Prerequisites: An MLP maps a vector to a vector + A graph is nodes connected by edges. That’s it.

Chapters

Simulations

Assumed Knowledge

Chapter 0: Data as Graphs

So much of the world is naturally a graph — a set of nodes connected by edges. A molecule is atoms (nodes) bonded together (edges). A social network is people connected by friendships. A knowledge graph links entities by relations. Road networks, citation graphs, protein interactions, recommendation systems — all graphs. The information isn’t just in the nodes; it’s in how they connect.

But like a point cloud, a graph has no grid and no fixed order. Nodes have different numbers of neighbors (an atom bonds to 1–4 others; a celebrity has millions of followers). There’s no “node to the right.” You cannot run a convolution, and you cannot flatten the nodes into a vector (there’s no canonical order). Graph Neural Networks are built for exactly this: they learn from graph-structured data by letting each node exchange information with its neighbors, respecting the connectivity. This lesson builds the GNN from its one core operation — message passing.

The trap: “just use the node features and ignore the edges.” Then you’ve thrown away the whole point — the structure. Two molecules can have identical atoms but different bonds and behave completely differently. A GNN’s power is that it makes each node’s representation depend on its neighborhood, so the connections shape the answer.

The world is full of graphs

A molecule, a social network, a grid — the first two are irregular graphs (varying connections), the last is the special regular case a CNN handles. Toggle between examples and see the irregularity.

Why can’t a standard CNN process a general graph?

Graphs are too small Graphs have no grid and no fixed order — nodes have varying numbers of neighbors, so there’s no kernel to slide CNNs can’t do classification

Chapter 1: Nodes, Edges, Features

Let’s pin down the ingredients. A graph has nodes, each with a feature vector (an atom’s type and charge; a user’s profile). It has edges connecting nodes, optionally with their own features (a bond’s type; a friendship’s strength). And it has structure — the connectivity pattern, who’s linked to whom, captured by an adjacency list.

The defining difficulty is irregularity. In an image, every pixel has exactly the same neighborhood shape (a fixed grid around it). In a graph, node A might have 2 neighbors and node B might have 500. Whatever operation we design must handle any number of neighbors, and (like point clouds) must be permutation invariant — the neighbors of a node have no order, so processing them must give the same result regardless of how they’re listed. These two requirements — handle variable neighbor counts, ignore neighbor order — shape everything that follows.

What two requirements must a graph operation satisfy?

Be fast and use little memory Handle any number of neighbors AND be permutation invariant (neighbor order must not matter) Only work on connected graphs

Chapter 2: Message Passing — the core operation

Almost every GNN is an instance of one idea: message passing. To update a node, do three steps. 1. Message: each neighbor computes a message (a function of its own feature, and possibly the edge). 2. Aggregate: combine all the incoming messages with a permutation-invariant function (sum, mean, or max). 3. Update: combine the aggregated message with the node’s current feature (through an MLP) to get its new feature.

Do this for every node simultaneously, and you have one GNN layer. After one layer, each node’s representation reflects itself plus its immediate neighbors (its 1-hop neighborhood). Stack a second layer and each node now indirectly hears from neighbors-of-neighbors (2 hops away), because its neighbors already absorbed their neighbors. After L layers, every node’s representation summarizes its entire L-hop neighborhood. Information diffuses across the graph, hop by hop — nodes gradually understand their place in the structure.

The mental picture: message passing is gossip. In round one, everyone tells their direct friends what they know. In round two, friends pass on what they heard, so you learn about friends-of-friends. After a few rounds, news from across the network reaches you — filtered and summarized through the connections. A GNN layer is one round of gossip; depth is how many rounds.

One round of message passing

Click a node: its neighbors send messages (arrows), which are aggregated and combined with the node’s own feature to update it. Every node does this at once — that’s one layer.

Now build it. Below is the gossip network as real code: 8 people, and only node 0 knows the news (feature 1.0; everyone else 0.0). The setup gives you the adjacency matrix A and the degrees; you write the two lines that are the layer — the aggregate and the update. Run the shipped version first: nothing spreads, because the placeholder aggregate is all zeros. Then make the gossip move.

What are the three steps of message passing?

Sort, filter, classify Message (each neighbor sends one), Aggregate (combine them symmetrically), Update (merge with the node’s own feature) Encode, decode, attend

Chapter 3: Aggregation Must Be Symmetric

The aggregate step is where permutation invariance lives — and it’s the same lesson as point clouds. A node’s neighbors have no order, so combining their messages must use a symmetric function: sum, mean, or max. Shuffle the neighbor list and the aggregate is unchanged. (Concatenating them in some order would not be invariant — it would break the moment a neighbor is relisted.)

The choice of aggregator matters. Mean captures the average neighbor (good for distributions, loses count information). Sum preserves count and is provably more expressive (it can tell apart a node with 2 neighbors from one with 4 identical neighbors — mean cannot). Max captures the most salient neighbor. Theoretical work (the GIN paper) showed sum is the most powerful for distinguishing graph structures. The aggregator is a key design choice, but the principle is fixed: it must be symmetric, or your GNN gives different answers for the same graph drawn differently.

Symmetric aggregators

A node’s neighbor messages, combined three ways. Press shuffle: the neighbor order changes, but sum/mean/max are unchanged. Switch aggregator to compare what each captures.

aggregatorsum

Why must the aggregation step be a symmetric function (sum/mean/max)?

To save computation Neighbors have no order, so only an order-independent function gives the same result regardless of how they’re listed (permutation invariance) Because graphs are always symmetric

Chapter 4: GCN — normalized neighbor averaging

The Graph Convolutional Network (Kipf & Welling, 2017) is the simplest, most famous GNN. Its message passing is: each node takes a normalized average of its neighbors’ features (including itself), then applies a linear layer and a nonlinearity. The “normalized” part divides by node degrees so that high-degree nodes don’t dominate — a message from a node with many connections is scaled down, since it’s “shouting to everyone.”

Worked example by hand

A node has feature value 1.0 and two neighbors with values 4.0 and 7.0. A simple GCN-style update (mean of self + neighbors, then a weight): mean = (1.0 + 4.0 + 7.0) / 3 = 4.0. Apply a learned weight of, say, 0.5: new feature = 0.5 × 4.0 = 2.0 (before the nonlinearity). The node’s value moved from 1.0 toward its neighborhood’s average — it has absorbed local context. Run this on every node, and the whole graph’s features smooth toward local consensus, layer by layer. GCN is essentially a learnable, structure-aware smoothing — spectral graph theory in disguise, but you can understand it entirely as “averaging your neighbors, normalized by degree.”

GCN: degree-normalized averaging

A node updates to the (degree-normalized) average of its neighbors plus itself. Drag the node’s starting value and watch it pulled toward its neighborhood’s mean — absorbing local context.

node starting value1.0

The “normalized” in Graph Convolutional Network is one specific matrix: Â = D^−1/2 (A + I) D^−1/2 — add self-loops, then divide each entry by the square roots of both endpoints’ degrees. That one matrix is the whole “shouting to everyone” fix: a hub’s messages get scaled down on the way out and on the way in. Build it on a 6-node hub graph and watch the hub’s row dim.

What is a GCN layer, in plain terms?

A node attends to all other nodes in the graph Each node takes a degree-normalized average of its neighbors (and itself), then a linear layer + nonlinearity — learnable, structure-aware smoothing A convolution on a pixel grid

Chapter 5: GAT — attention over neighbors

GCN treats all neighbors equally (just degree-normalized). But not all neighbors are equally relevant. The Graph Attention Network (GAT, 2018) lets each node learn how much to weight each neighbor. For each neighbor, it computes an attention score (from the two nodes’ features), softmaxes the scores over the neighborhood, and aggregates the neighbor messages as a weighted sum — paying more attention to the important neighbors.

This is exactly attention from transformers, applied over a node’s graph neighborhood instead of a sequence. It adapts to content: in a citation graph, a paper might weight a foundational reference heavily and an incidental one lightly. Attention is naturally permutation invariant (a weighted set operation) and handles variable neighbor counts. GAT often outperforms GCN because the learned weights let the model focus — and it reveals, interpretably, which neighbors mattered. It’s the same spectrum from the point-cloud lesson: GCN’s fixed averaging is the simple aggregator, GAT’s attention is the richer one.

GAT: learned neighbor importance

A node attends to its neighbors with learned weights (arrow thickness). Unlike GCN’s equal averaging, GAT emphasizes the relevant neighbors. Drag to change which neighbor is most relevant.

most-relevant neighbor0

How does GAT differ from GCN?

GAT ignores the graph structure GAT learns attention weights over neighbors (weighted aggregation) instead of GCN’s fixed degree-normalized average GAT only works on trees

Chapter 6: Depth & Over-Smoothing

More layers means a larger receptive field — each node hears from farther across the graph. So why not stack 20 layers? Because of over-smoothing. Each message-passing layer averages neighbors, pulling connected nodes’ features toward each other. Do this too many times and every node’s representation converges to nearly the same value — the whole graph blurs into mush, and nodes become indistinguishable. Useful signal is smoothed away.

So GNNs face a tension: enough depth to capture long-range structure, but not so much that everything collapses. In practice, most GNNs are shallow (2–4 layers). Fixes exist — residual/skip connections (let nodes keep their own signal), normalization, and architectures designed to resist smoothing — and there’s a related issue called over-squashing (too much information forced through too few edges, a bottleneck for long-range messages). But the core intuition is simple: averaging is smoothing, and too much smoothing erases the very distinctions you’re trying to learn.

Common misconception: “deeper GNN = better, like deep CNNs.” The opposite often holds. A deep CNN’s receptive field grows usefully; a deep GNN’s message passing homogenizes nodes. The right depth is roughly the number of hops of context a task needs — usually small — not “as deep as possible.”

Over-smoothing with depth

Node colors (features) at increasing depth. A few layers spread useful local context; too many and every node collapses to the same color — over-smoothed mush. Drag the number of layers and watch distinctions vanish.

layers2

Don’t take the widget’s word for it — cause the collapse yourself. Below, 14 nodes start with random, distinct features. You write the one line that is a smoothing layer (each node becomes the mean of itself + its neighbors), apply it 12 times, and measure the variance of the node features after every layer. Averaging is smoothing; the variance is the proof.

What is over-smoothing in GNNs?

The graph becomes disconnected Too many message-passing layers average node features until all nodes converge to nearly the same representation, erasing useful distinctions The model overfits the training graph

Chapter 7: Message Passing, Live (showcase)

Watch information diffuse across a graph. A signal starts at one node; each layer of message passing spreads it one hop further. With the right depth, a node-classification signal propagates to label the graph; with too much, it over-smooths. Inject a signal, step through the layers, and see the L-hop receptive field grow — then collapse.

Information diffusing through the graph

Click a node to inject a signal, then press Step to apply message-passing layers. Watch the signal spread one hop per layer (the receptive field). Keep stepping and it over-smooths — the whole graph homogenizes. The readout tracks reach and smoothing.

The sweet spot is a few layers: enough for the signal to reach the relevant neighborhood, not so many that everything blurs together. That balance — receptive field versus over-smoothing — is the central practical tension in GNN design, and watching it play out is the whole intuition.

Chapter 8: Tasks, Applications & a Unifying View

GNNs handle three task levels, all from the same node representations:

Node-level: classify each node (is this user a bot? what type is this atom?) — use the final node features directly.
Edge-level (link prediction): will these two nodes connect? (friend recommendation, drug interaction) — combine the two nodes’ features.
Graph-level: classify the whole graph (is this molecule toxic?) — pool all node features (a symmetric readout) into one graph embedding.

Applications are everywhere: drug discovery (molecules as graphs — predicting properties, generating candidates), recommendation (user–item graphs), traffic/maps (road networks — Google Maps ETA uses GNNs), fraud detection, knowledge graphs. And here’s the beautiful unifying view: a transformer is a GNN on a fully-connected graph with attention as the aggregator; a point cloud net is a GNN on a nearest-neighbor graph; a CNN is a GNN on a grid graph. Message passing over neighbors, with a symmetric aggregator, is a deeply general template — GNNs just make the graph explicit.

One template, many architectures

The same message-passing idea on different graphs: a grid (= CNN), a nearest-neighbor graph (= point net), a fully-connected graph with attention (= transformer), and a general graph (= GNN). Drag to see them as one family.

graph typegeneral

In the unifying view, a transformer is:

unrelated to GNNs a GNN on a fully-connected graph, with attention as the aggregator a GNN with no message passing

Chapter 9: Cheat Sheet & Connections

graph

nodes + edges + features; irregular, unordered neighbors

↓ message passing (one layer)

message → aggregate → update

neighbors send messages, combine symmetrically (sum/mean/max), merge with own feature

↓ choice of aggregator

GCN / GAT

GCN: degree-normalized average · GAT: learned attention weights

↓ stack L layers (mind over-smoothing)

L-hop representations

→ node / link / graph (pooled) predictions

GNN	Aggregator	Note
GCN	degree-normalized mean	simplest; structure-aware smoothing
GraphSAGE	sample + mean/max/LSTM	scalable to huge graphs
GAT	attention-weighted sum	learns neighbor importance
GIN	sum + MLP	most expressive (distinguishes structures)

Keep exploring

→ Point Clouds — a nearest-neighbor graph; same symmetric aggregation
→ Attention Variants — the attention GAT borrows
→ The Transformer — a GNN on a fully-connected graph
→ Pooling & Aggregation — the symmetric readout for graph-level tasks

“What I cannot create, I do not understand.” You just rebuilt the GNN from one operation: each node gathers messages from its neighbors, combines them with a symmetric function, and updates itself — repeated until every node understands its neighborhood. Mind the over-smoothing, pick your aggregator, and you can learn on molecules, maps, and networks alike.