How to learn node embeddings by aggregating neighbor information through neural network layers — and why this is more powerful than any lookup table.
In Lecture 2 you learned to embed nodes using DeepWalk and node2vec. The idea was beautiful: assign every node a learnable vector, train those vectors so similar nodes end up nearby in embedding space. But there's a catch — one that becomes obvious the moment you try to use these embeddings in the real world.
Suppose you train node2vec on a social network with 1 million users. You get a 128-dimensional vector for each user — 128 million numbers total. Now a new user signs up. What's their embedding? There is none. The lookup table has no row for a node it has never seen. You'd have to retrain the entire model from scratch.
Let's be precise about what goes wrong with shallow (lookup-table) embeddings:
1. O(|V| · d) parameters. Every node gets its own d-dimensional row in the embedding matrix Z ∈ R|V| × d. For a graph with 100 million nodes and 128 dimensions: 12.8 billion parameters. Just for the embeddings. Before any prediction head.
2. No parameter sharing. Node A's embedding is unrelated to node B's embedding, even if they play structurally identical roles. Two users who both have 10 friends in a tight clique should get similar embeddings — but a lookup table learns each independently, requiring enormous amounts of data to discover this pattern.
3. Ignores node features. Graphs in the real world come with rich features: protein graphs have amino acid properties, social graphs have user profiles, citation graphs have paper abstracts. A lookup table throws all of this away. Its embedding for node v is the same whether v represents a biologist named Alice or a retired plumber named Bob.
4. Transductive — can't generalize. The trained embedding only exists for training nodes. New nodes, new graphs, different graphs with the same structure — all require retraining.
We need an encoder function that takes a node and its graph context as INPUT and computes an embedding as OUTPUT. Not a lookup table that memorizes. A function that computes.
This encoder should look at the node's features AND its position in the graph structure — its neighbors, its neighbors' neighbors, and so on. If you train this encoder on one graph, it should be able to compute embeddings for nodes it's never seen before, because it reasons from features and structure, not from memorized lookup indices.
That's exactly what a Graph Neural Network does. The rest of this lesson builds it up piece by piece.
A lookup table has no embedding for new nodes. Click "Add New Node" to see what happens when a node appears that wasn't in training.
Here is the key idea of GNNs. Say you want to embed a node v. Instead of looking it up in a table, you compute its embedding by looking at who its neighbors are and what those neighbors look like.
But then how do you embed v's neighbors? By looking at THEIR neighbors. And v's neighbors' neighbors? By looking at THEIR neighbors. You recurse K times, where K is the number of "layers" in the GNN.
One GNN layer does two things: (1) collect information from all neighbors, (2) combine it with the node's own information using a neural network. Let hv(k) denote the embedding of node v after k layers. Initially (k=0), each node's embedding is just its feature vector xv:
After one layer of aggregation and transformation:
Breaking this down symbol by symbol:
After K layers, hv(K) is the final embedding zv for node v. It has "seen" its K-hop neighborhood.
With K = 1, each node sees its direct neighbors. With K = 2, each node sees its 2-hop neighborhood (neighbors of neighbors). With K = 3, it sees 3 hops. In practice, K = 2 or K = 3 works well for most tasks. Beyond K = 5 or 6, performance often degrades — a phenomenon called over-smoothing, where all nodes' embeddings converge to the same vector because information has spread too far.
Let's make the aggregation process concrete. When we compute node v's embedding with K GNN layers, we're really building a tree — the computation graph rooted at v. This tree records exactly what information flows into v's embedding.
At the bottom of the tree (layer 0) are the raw input features of every node in v's K-hop neighborhood. Each layer aggregates information one step "upward" toward the root, until we reach the root (layer K), which is v's final embedding.
Consider a small graph: A–B–C–D, a simple path. Let's compute node B's embedding with K = 2 layers.
Layer 0: Every node starts with its feature vector: xA, xB, xC, xD.
Layer 1: We need hB(1). B's neighbors are A and C. So we aggregate {hA(0), hC(0)} = {xA, xC}, transform with W1, combine with hB(0) = xB via B1, and activate. Similarly we compute hA(1) (needs xB) and hC(1) (needs xB, xD).
Layer 2: We need hB(2). B's neighbors are A and C. So we aggregate {hA(1), hC(1)}, transform, combine with hB(1). Notice that hA(1) already contains information about xB, and hC(1) contains information about xB and xD. So hB(2) contains information about the entire 2-hop neighborhood of B: {A, B, C, D}.
Unrolling this gives a tree rooted at B. The tree is NOT the graph — it's the computation graph. A node can appear multiple times in the tree (e.g., B appears as both the root and as a leaf in A's branch). But crucially, these different appearances share the SAME weight matrices W1, B1, W2, B2 — shared parameters, different computations.
Click a node to see its K=2 computation graph (the tree of neighbors that determines its embedding). The same weight matrices are shared at every tree node at the same layer.
Now let's look at the most famous concrete GNN: the Graph Convolutional Network (GCN), introduced by Kipf & Welling at ICLR 2017 (arXiv: 1609.02907). It's the simplest possible instantiation of the aggregate-and-transform idea, and it became the foundational model for nearly all graph deep learning that followed.
GCN uses mean aggregation — take the average of all neighbors' embeddings, then apply a linear transformation and nonlinearity:
Where |N(v)| is the number of neighbors of v. The mean normalizes by degree — a node with 100 neighbors doesn't produce a 100× larger signal than a node with 1 neighbor. This normalization is crucial for numerical stability.
Let's be concrete about shapes. Say we have a graph with 6 nodes, each with 3 input features. We use GCN with K=2 layers and hidden dimension d=4.
Input: X ∈ R6 × 3 — feature matrix. Row i = feature vector of node i.
Layer 1: W1 ∈ R3 × 4. After aggregation: H(1) ∈ R6 × 4.
Layer 2: W2 ∈ R4 × 4. After aggregation: H(2) ∈ R6 × 4.
Total parameters: W1 (12) + W2 (16) = 28. Completely independent of |V| = 6.
Compare to shallow embeddings: 6 × 4 = 24 parameters, but scaling to 6 million nodes would need 24 million. GCN still uses 28.
A small graph with 6 nodes, each with a scalar feature (shown as size/color). Watch how features propagate through 2 GCN layers. Press Step to advance one layer. Watch embeddings evolve.
python # GCN layer in PyTorch — exact implementation import torch import torch.nn as nn class GCNLayer(nn.Module): def __init__(self, in_dim, out_dim): super().__init__() # W and B combined into one linear layer (operates on [neighbor_agg || self]) self.linear = nn.Linear(in_dim, out_dim) self.activation = nn.ReLU() def forward(self, H, A_norm): # H: [N, d_in] — node embeddings # A_norm: [N, N] — normalized adjacency with self-loops # A_norm[i,j] = 1/sqrt(deg_i * deg_j) for (i,j) edge aggregated = A_norm @ H # [N, d_in] — weighted mean of neighbors return self.activation(self.linear(aggregated)) # [N, d_out] class GCN(nn.Module): def __init__(self, in_dim, hidden_dim, out_dim, num_classes): super().__init__() self.layer1 = GCNLayer(in_dim, hidden_dim) self.layer2 = GCNLayer(hidden_dim, out_dim) self.classifier = nn.Linear(out_dim, num_classes) def forward(self, X, A_norm): # X: [N, in_dim] — raw node features H1 = self.layer1(X, A_norm) # [N, hidden_dim] H2 = self.layer2(H1, A_norm) # [N, out_dim] return self.classifier(H2) # [N, num_classes]
The node-by-node update rule is conceptually clear, but for actual computation we want a matrix formula that lets us update all nodes simultaneously with efficient linear algebra. Kipf & Welling's paper derives this matrix form carefully.
Start with the adjacency matrix A (N × N). The (i,j) entry is 1 if there's an edge, 0 otherwise. Now add self-loops: Ã = A + IN. This means each node is its own neighbor, so the self-connection Bk term naturally merges into the aggregation.
Let D̃ be the diagonal degree matrix of Ã: D̃ii = ∑j Ãij = deg(i) + 1 (degree plus self-loop). Then the mean aggregation over all nodes simultaneously is:
This is the GCN equation from the paper. Let's unpack each piece:
à = A + I: Adjacency with self-loops. The self-loop is why we don't need a separate Bk term anymore — node v is its own neighbor.
D̃−½ Ã D̃−½: The symmetric normalized adjacency. Entry (i,j) becomes 1/√(deg(i)·deg(j)) for connected nodes. This is a more symmetric normalization than simple row-normalization.
H(k−1) Wk: Linear transformation applied to all node embeddings. Wk ∈ Rdk−1 × dk is the only learned parameter at layer k.
σ: Nonlinearity (ReLU between layers, softmax at the last layer for classification).
Simple row normalization D̃−1à would make row i sum to 1. But this is asymmetric — if node A has 10 neighbors and node B has 2 neighbors, information flowing from A to B gets divided by 10, while information flowing from B to A gets divided by 2. The symmetric normalization D̃−½Ã D̃−½ ensures that BOTH the sender and receiver degree are accounted for. Edge (i,j) gets weight 1/√(di·dj). This is more geometrically natural and produces better empirical results.
The normalized adjacency à = D̃−½Ã D̃−½ depends only on the graph structure, not on the learned parameters. You can compute it ONCE before training and reuse it throughout. This makes each training step extremely fast: just matrix multiplications à H(k−1) Wk.
python # Pre-compute normalized adjacency — done ONCE before training import torch import scipy.sparse as sp import numpy as np def normalize_adjacency(A): # A: scipy sparse adjacency matrix [N, N] N = A.shape[0] A_tilde = A + sp.eye(N) # Ã = A + I (self-loops) D_tilde = np.array(A_tilde.sum(1)).flatten() # D̃ diagonal D_inv_sqrt = sp.diags(D_tilde ** -0.5) # D̃^(-½) A_norm = D_inv_sqrt @ A_tilde @ D_inv_sqrt # D̃^(-½) Ã D̃^(-½) return torch.FloatTensor(A_norm.toarray()) # [N, N] dense tensor # Pre-compute once: A_norm = normalize_adjacency(A) # cached for the entire training run # Each forward pass is just: # H^(k) = relu(A_norm @ H^(k-1) @ W_k)
You have a GNN that maps input features X and graph structure A to node embeddings Z. Now how do you train it? The answer depends on how much labeled data you have.
You have a graph with N nodes. Some nodes have labels (e.g., "spam" or "real user"). The GNN produces an embedding zv for every node. A linear classifier on top predicts the label:
The loss is cross-entropy over all LABELED nodes:
Backprop flows through the classification head, through zv, through all K GNN layers. The weight matrices W1, ..., WK, Wout are all updated.
The Kipf & Welling paper's key setup: only a TINY fraction of nodes are labeled (e.g., 20 labeled nodes out of 2708 total in the Cora citation graph — less than 1%). The GNN still runs on ALL nodes and uses the ENTIRE graph structure. Even unlabeled nodes influence the embeddings of labeled nodes through the message passing. The model learns to propagate label information through the graph structure.
No labels at all? Use the graph structure itself as the training signal. The DeepWalk/node2vec objective from Lecture 2 works perfectly:
Nodes that co-occur on random walks should have high dot product similarity. The GNN is used as the encoder instead of a lookup table — same objective, better encoder.
GNNs aren't limited to node classification. For link prediction, compute zu and zv and predict whether edge (u,v) should exist (dot product or MLP). For graph classification, aggregate all node embeddings into a single graph embedding (mean/sum/max pooling over nodes), then classify the whole graph.
Here's a fact that makes the universe feel more unified: a CNN is just a GNN on a grid graph. If you don't believe me, let's build the connection.
Take a 32×32 image — 1024 pixels. Define a graph where each pixel is a node, and two pixels are connected if they are adjacent (up, down, left, right, or diagonally). You get a grid graph. Each node has a feature vector equal to the pixel's RGB values.
Now apply a GNN with mean aggregation on this grid graph. What does one layer do?
For pixel (i,j), the aggregation collects features from the 8 neighboring pixels and averages them, then applies W. This is EXACTLY a 3×3 convolution with uniform weights. The convolutional kernel IS the aggregation weight matrix W.
If graphs generalize CNNs, why don't we just run GNNs on images? Practical reasons:
But for data that is ACTUALLY a graph — molecules, social networks, citation graphs, protein interaction networks — there is no grid. The GNN's generality is not optional; it's required.
A 5x5 pixel grid with one node highlighted (orange). Its 8 connected neighbors are shown in teal. In a standard 3x3 conv, EXACTLY these neighbors' values are aggregated. Toggle "Graph View" to see the grid as an adjacency graph.
We said CNNs are GNNs on grid graphs. Now let's push further: Transformers are GNNs on fully connected graphs.
In a Transformer, self-attention computes, for each token i, a weighted sum of all other tokens' values:
Compare to GNN message passing at one layer:
The structure is identical: each node aggregates weighted messages from its neighbors. In a Transformer, the "graph" is the fully connected graph — every token is a neighbor of every other token. The edge weights are the attention scores αij, which are computed dynamically from the content of the tokens (unlike fixed GCN weights W).
Why not just use Transformers for all graph data? Two reasons:
Quadratic cost: For a graph with N nodes, a Transformer attention layer requires O(N2) operations — computing a score for every pair. For a citation graph with 2 million papers, that's 4 trillion operations per layer. GNNs on the actual sparse graph (say, average degree 10) require O(10N) = O(N) operations per layer — 400,000x cheaper.
Structural inductive bias: Real graphs are sparse because there's a reason edges exist — chemical bonds, friend connections, citation links. These edges encode domain knowledge. Giving the model ALL possible edges (Transformer) dilutes this signal with noise from pairs that have no meaningful relationship.
The design choice is clear: use GNNs for data with a meaningful sparse graph structure; use Transformers when you want the model to learn which relationships matter (at quadratic cost).
| Model | Graph | Edge Weights | Cost |
|---|---|---|---|
| CNN | Grid (fixed) | Kernel (static) | O(N) |
| GNN | Sparse (given) | Uniform / learned | O(E) |
| Transformer | Complete | Attention (dynamic) | O(N²) |
We've now seen enough to close the loop on Chapter 0's problem: why do shallow embeddings fail for new nodes, and how do GNNs fix it?
A lookup-table embedding is transductive: it can only make predictions for nodes that were present during training. The embedding for node v is the v-th row of the weight matrix Z. If v wasn't in the training graph, there is no v-th row.
Think of it like a phone book: you look up "Alice" and get her number. But a phone book can't tell you the number for someone who moved to town after it was printed.
A GNN is inductive: it can compute embeddings for nodes it has never seen before. Here's why:
The GNN's parameters are the weight matrices W1, ..., WK. These are NOT node-specific — they're shared functions. To embed a new node v, you simply RUN the GNN: collect v's features, collect v's neighbors' features, apply the message-passing layers. The GNN has never seen v before, but it can still compute zv using the same learned aggregation function it applied to training nodes.
Inductivity means GNNs can even generalize across DIFFERENT graphs. Train a GNN for drug discovery on one set of molecules. Test it on a completely new molecule. The same aggregation weights apply — they've learned general chemistry patterns, not molecule-specific memories.
This is the critical distinction for deployed systems. In a real social network, thousands of new users sign up every hour. A transductive method is useless — you can't retrain from scratch every hour. An inductive GNN embeds new users in milliseconds by running the forward pass.
| Property | Shallow (Transductive) | GNN (Inductive) |
|---|---|---|
| New nodes | Fail — no entry | Works — run forward pass |
| New graphs | Fail — retrain needed | Works — same weights |
| Uses features | No | Yes (natural) |
| Parameters | O(|V|·d) | O(K·d²) — fixed! |
| Example | DeepWalk, node2vec | GCN, GraphSAGE, GAT |
Let's see where GNNs fit in the broader landscape — what they build on, what they unlock, and where to go next.
We started from the failure modes of shallow embeddings (Chapter 0) and derived the GNN update rule (Chapter 1) from first principles. We saw how computation graphs (Chapter 2) make the aggregation concrete, instantiated it as GCN (Chapter 3), derived the matrix form (Chapter 4), learned how to train it (Chapter 5), and connected it to CNNs (Chapter 6) and Transformers (Chapter 7). Chapter 8 showed why GNNs can generalize while lookup tables cannot.
The foundation is Kipf & Welling's GCN (ICLR 2017, arXiv:1609.02907). The full derivation — from spectral convolutions through Chebyshev approximation to the final layer rule — is covered in the Veanors paper lesson. Reading the paper after this lesson will feel completely natural.
Lecture 1 covered the graph ML landscape and why graphs matter. Lecture 2 built shallow embeddings (DeepWalk, node2vec). This lecture replaced the lookup-table encoder with a neural network. Next:
| Limitation | Fixed By |
|---|---|
| Mean aggregation loses information about neighbor distribution | GraphSAGE (concat), GAT (attention), GIN (sum — maximally expressive) |
| Over-smoothing with many layers (embeddings converge) | Skip connections, DropEdge, JK-networks |
| Can't distinguish certain graph structures (1-WL test bound) | Higher-order GNNs, subgraph GNNs |
| Slow on very large graphs (full-batch message passing) | Mini-batch sampling (GraphSAGE), cluster-GCN, neighbor sampling |
| Directed / heterogeneous graphs need special handling | R-GCN (relation-type-specific weights), HAN (heterogeneous attention) |
The unsupervised training signal (random walk co-occurrence) comes from Lecture 2: Node Embeddings. The Transformer connection explored in Chapter 7 ties back to the Attention & Transformers lesson. For molecular GNNs specifically, see how this connects to drug discovery via the GCN paper walkthrough.
"The key insight of deep learning is that you can learn representations. GNNs extend this to graphs: instead of hand-crafting graph features, let the network learn to aggregate neighborhood information in a task-optimal way."
— Jure Leskovec, CS224W