CS224W Lecture 3

Graph Neural Networks

How to learn node embeddings by aggregating neighbor information through neural network layers — and why this is more powerful than any lookup table.

Prerequisites: Graph basics (Lec 1) + Node embeddings (Lec 2) + Basic neural networks. That's it.
10
Chapters
4+
Simulations
0
Assumed Knowledge

Chapter 0: Why Shallow Embeddings Fail

In Lecture 2 you learned to embed nodes using DeepWalk and node2vec. The idea was beautiful: assign every node a learnable vector, train those vectors so similar nodes end up nearby in embedding space. But there's a catch — one that becomes obvious the moment you try to use these embeddings in the real world.

Suppose you train node2vec on a social network with 1 million users. You get a 128-dimensional vector for each user — 128 million numbers total. Now a new user signs up. What's their embedding? There is none. The lookup table has no row for a node it has never seen. You'd have to retrain the entire model from scratch.

The fundamental problem with lookup tables: Shallow embeddings are transductive — they can only embed nodes that existed during training. They cannot generalize to new nodes, new graphs, or new domains. And they don't use node features at all. Every node's embedding is learned from scratch, independent of what that node actually represents.

The Three Failures

Let's be precise about what goes wrong with shallow (lookup-table) embeddings:

1. O(|V| · d) parameters. Every node gets its own d-dimensional row in the embedding matrix Z ∈ R|V| × d. For a graph with 100 million nodes and 128 dimensions: 12.8 billion parameters. Just for the embeddings. Before any prediction head.

2. No parameter sharing. Node A's embedding is unrelated to node B's embedding, even if they play structurally identical roles. Two users who both have 10 friends in a tight clique should get similar embeddings — but a lookup table learns each independently, requiring enormous amounts of data to discover this pattern.

3. Ignores node features. Graphs in the real world come with rich features: protein graphs have amino acid properties, social graphs have user profiles, citation graphs have paper abstracts. A lookup table throws all of this away. Its embedding for node v is the same whether v represents a biologist named Alice or a retired plumber named Bob.

4. Transductive — can't generalize. The trained embedding only exists for training nodes. New nodes, new graphs, different graphs with the same structure — all require retraining.

What We Actually Need

We need an encoder function that takes a node and its graph context as INPUT and computes an embedding as OUTPUT. Not a lookup table that memorizes. A function that computes.

ENC(v, G) → zv

This encoder should look at the node's features AND its position in the graph structure — its neighbors, its neighbors' neighbors, and so on. If you train this encoder on one graph, it should be able to compute embeddings for nodes it's never seen before, because it reasons from features and structure, not from memorized lookup indices.

That's exactly what a Graph Neural Network does. The rest of this lesson builds it up piece by piece.

The Shallow Embedding Problem

A lookup table has no embedding for new nodes. Click "Add New Node" to see what happens when a node appears that wasn't in training.

Lookup table: 5 nodes, 5 embedding vectors learned. All good.
What is the key limitation that makes shallow embeddings "transductive"?

Chapter 1: Aggregate & Transform

Here is the key idea of GNNs. Say you want to embed a node v. Instead of looking it up in a table, you compute its embedding by looking at who its neighbors are and what those neighbors look like.

But then how do you embed v's neighbors? By looking at THEIR neighbors. And v's neighbors' neighbors? By looking at THEIR neighbors. You recurse K times, where K is the number of "layers" in the GNN.

The core intuition: A node's embedding should capture its role in the local graph structure. A node surrounded by high-degree hubs should embed differently from a node surrounded by leaves. To know the neighborhood, you must aggregate information from neighbors.

One Layer of a GNN

One GNN layer does two things: (1) collect information from all neighbors, (2) combine it with the node's own information using a neural network. Let hv(k) denote the embedding of node v after k layers. Initially (k=0), each node's embedding is just its feature vector xv:

hv(0) = xv

After one layer of aggregation and transformation:

hv(k) = σ( Wk · AGGREGATE( {hu(k-1) : u ∈ N(v)} ) + Bk · hv(k-1) )

Breaking this down symbol by symbol:

After K layers, hv(K) is the final embedding zv for node v. It has "seen" its K-hop neighborhood.

The critical insight about parameter sharing: The weight matrices Wk and Bk are the same for all nodes at layer k. Node A and node B use the same W1 to aggregate their respective neighbors. This is like how a CNN uses the same filter kernel at every spatial location — the parameters are shared across the graph, regardless of graph size. This is WHY GNNs scale and WHY they can generalize to new nodes.

How Many Layers?

With K = 1, each node sees its direct neighbors. With K = 2, each node sees its 2-hop neighborhood (neighbors of neighbors). With K = 3, it sees 3 hops. In practice, K = 2 or K = 3 works well for most tasks. Beyond K = 5 or 6, performance often degrades — a phenomenon called over-smoothing, where all nodes' embeddings converge to the same vector because information has spread too far.

Layer 0 (Input)
hv(0) = xv — raw feature vector for each node. Shape: [|V|, din]
↓ aggregate neighbors + transform
Layer 1
hv(1) encodes 1-hop neighborhood. Each node "knows about" its immediate neighbors. Shape: [|V|, d]
↓ aggregate neighbors + transform
Layer 2
hv(2) encodes 2-hop neighborhood. Each node "knows about" its neighbors' neighbors. Shape: [|V|, d]
↓ (K times total)
Output
zv = hv(K) — final embedding, ready for downstream tasks.
In the GNN update equation hv(k) = σ(Wk · AGGREGATE(...) + Bk · hv(k-1)), what do Wk and Bk represent, and how are they applied across nodes?

Chapter 2: Computation Graphs

Let's make the aggregation process concrete. When we compute node v's embedding with K GNN layers, we're really building a tree — the computation graph rooted at v. This tree records exactly what information flows into v's embedding.

At the bottom of the tree (layer 0) are the raw input features of every node in v's K-hop neighborhood. Each layer aggregates information one step "upward" toward the root, until we reach the root (layer K), which is v's final embedding.

A computation graph is different for every node. Node A may have 3 neighbors, so its layer-1 computation graph has 3 branches. Node B may have 7 neighbors, so its layer-1 computation graph has 7 branches. Same model — but different computational shapes depending on the neighborhood topology. This is a key difference from standard neural networks, which have fixed, uniform computation graphs for every input.

An Example

Consider a small graph: A–B–C–D, a simple path. Let's compute node B's embedding with K = 2 layers.

Layer 0: Every node starts with its feature vector: xA, xB, xC, xD.

Layer 1: We need hB(1). B's neighbors are A and C. So we aggregate {hA(0), hC(0)} = {xA, xC}, transform with W1, combine with hB(0) = xB via B1, and activate. Similarly we compute hA(1) (needs xB) and hC(1) (needs xB, xD).

Layer 2: We need hB(2). B's neighbors are A and C. So we aggregate {hA(1), hC(1)}, transform, combine with hB(1). Notice that hA(1) already contains information about xB, and hC(1) contains information about xB and xD. So hB(2) contains information about the entire 2-hop neighborhood of B: {A, B, C, D}.

The Tree Structure

Unrolling this gives a tree rooted at B. The tree is NOT the graph — it's the computation graph. A node can appear multiple times in the tree (e.g., B appears as both the root and as a leaf in A's branch). But crucially, these different appearances share the SAME weight matrices W1, B1, W2, B2 — shared parameters, different computations.

Computation Graph Explorer

Click a node to see its K=2 computation graph (the tree of neighbors that determines its embedding). The same weight matrices are shared at every tree node at the same layer.

Click a node to expand its K=2 computation graph.
Why is node v's computation graph a TREE and not the original graph?

Chapter 3: GCN — The Simplest GNN

Now let's look at the most famous concrete GNN: the Graph Convolutional Network (GCN), introduced by Kipf & Welling at ICLR 2017 (arXiv: 1609.02907). It's the simplest possible instantiation of the aggregate-and-transform idea, and it became the foundational model for nearly all graph deep learning that followed.

The GCN Update Rule

GCN uses mean aggregation — take the average of all neighbors' embeddings, then apply a linear transformation and nonlinearity:

hv(k) = σ( Wk · 1|N(v)|u ∈ N(v) hu(k-1) + Bk · hv(k-1) )

Where |N(v)| is the number of neighbors of v. The mean normalizes by degree — a node with 100 neighbors doesn't produce a 100× larger signal than a node with 1 neighbor. This normalization is crucial for numerical stability.

Often simplified by folding the self-loop in: The most common GCN formulation adds a self-loop to the graph (Ã = A + I) and combines Wk and Bk into a single matrix. Then the update is just: hv(k) = σ(Wk · mean({hu(k-1) : u ∈ Ñ(v)})) where Ñ(v) = N(v) ∪ {v}. Simpler and often slightly better.

Data Flow: What Goes In, What Comes Out

Let's be concrete about shapes. Say we have a graph with 6 nodes, each with 3 input features. We use GCN with K=2 layers and hidden dimension d=4.

Input: X ∈ R6 × 3 — feature matrix. Row i = feature vector of node i.

Layer 1: W1 ∈ R3 × 4. After aggregation: H(1) ∈ R6 × 4.

Layer 2: W2 ∈ R4 × 4. After aggregation: H(2) ∈ R6 × 4.

Total parameters: W1 (12) + W2 (16) = 28. Completely independent of |V| = 6.

Compare to shallow embeddings: 6 × 4 = 24 parameters, but scaling to 6 million nodes would need 24 million. GCN still uses 28.

The Three Steps of One GCN Layer

1. SEND — Neighbors Broadcast
Each neighbor u sends its current embedding hu(k-1) to node v.
2. AGGREGATE — Mean Pool
Node v averages all received messages: mv = (1/|N(v)|) ∑u ∈ N(v) hu(k-1).
3. UPDATE — Neural Network
hv(k) = ReLU(Wk · mv + Bk · hv(k-1)).
GCN Message Passing — Step by Step

A small graph with 6 nodes, each with a scalar feature (shown as size/color). Watch how features propagate through 2 GCN layers. Press Step to advance one layer. Watch embeddings evolve.

Layer 0: Raw input features. Each node shows its scalar feature value.
python
# GCN layer in PyTorch — exact implementation
import torch
import torch.nn as nn

class GCNLayer(nn.Module):
    def __init__(self, in_dim, out_dim):
        super().__init__()
        # W and B combined into one linear layer (operates on [neighbor_agg || self])
        self.linear = nn.Linear(in_dim, out_dim)
        self.activation = nn.ReLU()

    def forward(self, H, A_norm):
        # H: [N, d_in] — node embeddings
        # A_norm: [N, N] — normalized adjacency with self-loops
        # A_norm[i,j] = 1/sqrt(deg_i * deg_j) for (i,j) edge
        aggregated = A_norm @ H        # [N, d_in] — weighted mean of neighbors
        return self.activation(self.linear(aggregated))  # [N, d_out]

class GCN(nn.Module):
    def __init__(self, in_dim, hidden_dim, out_dim, num_classes):
        super().__init__()
        self.layer1 = GCNLayer(in_dim, hidden_dim)
        self.layer2 = GCNLayer(hidden_dim, out_dim)
        self.classifier = nn.Linear(out_dim, num_classes)

    def forward(self, X, A_norm):
        # X: [N, in_dim] — raw node features
        H1 = self.layer1(X, A_norm)    # [N, hidden_dim]
        H2 = self.layer2(H1, A_norm)   # [N, out_dim]
        return self.classifier(H2)     # [N, num_classes]
In GCN, why does the mean aggregation divide by |N(v)| (the number of neighbors)?

Chapter 4: Matrix Form of GCN

The node-by-node update rule is conceptually clear, but for actual computation we want a matrix formula that lets us update all nodes simultaneously with efficient linear algebra. Kipf & Welling's paper derives this matrix form carefully.

Building Up to the Formula

Start with the adjacency matrix A (N × N). The (i,j) entry is 1 if there's an edge, 0 otherwise. Now add self-loops: Ã = A + IN. This means each node is its own neighbor, so the self-connection Bk term naturally merges into the aggregation.

Let D̃ be the diagonal degree matrix of Ã: D̃ii = ∑j Ãij = deg(i) + 1 (degree plus self-loop). Then the mean aggregation over all nodes simultaneously is:

H(k) = σ( D̃−½ Ã D̃−½ H(k−1) Wk )

This is the GCN equation from the paper. Let's unpack each piece:

à = A + I: Adjacency with self-loops. The self-loop is why we don't need a separate Bk term anymore — node v is its own neighbor.

−½ Ã D̃−½: The symmetric normalized adjacency. Entry (i,j) becomes 1/√(deg(i)·deg(j)) for connected nodes. This is a more symmetric normalization than simple row-normalization.

H(k−1) Wk: Linear transformation applied to all node embeddings. Wk ∈ Rdk−1 × dk is the only learned parameter at layer k.

σ: Nonlinearity (ReLU between layers, softmax at the last layer for classification).

Why Symmetric Normalization?

Simple row normalization D̃−1à would make row i sum to 1. But this is asymmetric — if node A has 10 neighbors and node B has 2 neighbors, information flowing from A to B gets divided by 10, while information flowing from B to A gets divided by 2. The symmetric normalization D̃−½Ã D̃−½ ensures that BOTH the sender and receiver degree are accounted for. Edge (i,j) gets weight 1/√(di·dj). This is more geometrically natural and produces better empirical results.

Connection to spectral graph theory:−½Ã D̃−½ is directly related to the normalized graph Laplacian L = I − D−½AD−½. The paper motivates GCN from a spectral perspective — it's a first-order approximation of spectral graph convolutions using Chebyshev polynomials. But you don't need spectral theory to USE GCN. The message passing view is sufficient. The spectral view just explains WHY this particular normalization is theoretically justified.

Pre-computing the Propagation Matrix

The normalized adjacency à = D̃−½Ã D̃−½ depends only on the graph structure, not on the learned parameters. You can compute it ONCE before training and reuse it throughout. This makes each training step extremely fast: just matrix multiplications à H(k−1) Wk.

python
# Pre-compute normalized adjacency — done ONCE before training
import torch
import scipy.sparse as sp
import numpy as np

def normalize_adjacency(A):
    # A: scipy sparse adjacency matrix [N, N]
    N = A.shape[0]
    A_tilde = A + sp.eye(N)            # Ã = A + I (self-loops)
    D_tilde = np.array(A_tilde.sum(1)).flatten()  # D̃ diagonal
    D_inv_sqrt = sp.diags(D_tilde ** -0.5)       # D̃^(-½)
    A_norm = D_inv_sqrt @ A_tilde @ D_inv_sqrt    # D̃^(-½) Ã D̃^(-½)
    return torch.FloatTensor(A_norm.toarray())    # [N, N] dense tensor

# Pre-compute once:
A_norm = normalize_adjacency(A)  # cached for the entire training run

# Each forward pass is just:
# H^(k) = relu(A_norm @ H^(k-1) @ W_k)
What does the symmetric normalization D̃Ã D̃ achieve compared to simple row-normalization D̃-1Ã?

Chapter 5: Training a GNN

You have a GNN that maps input features X and graph structure A to node embeddings Z. Now how do you train it? The answer depends on how much labeled data you have.

Supervised Node Classification

You have a graph with N nodes. Some nodes have labels (e.g., "spam" or "real user"). The GNN produces an embedding zv for every node. A linear classifier on top predicts the label:

ŷv = softmax(Wout · zv)

The loss is cross-entropy over all LABELED nodes:

L = − ∑v ∈ Vlabeled yv · log(ŷv)

Backprop flows through the classification head, through zv, through all K GNN layers. The weight matrices W1, ..., WK, Wout are all updated.

Semi-Supervised (The GCN Paper's Setting)

The Kipf & Welling paper's key setup: only a TINY fraction of nodes are labeled (e.g., 20 labeled nodes out of 2708 total in the Cora citation graph — less than 1%). The GNN still runs on ALL nodes and uses the ENTIRE graph structure. Even unlabeled nodes influence the embeddings of labeled nodes through the message passing. The model learns to propagate label information through the graph structure.

Why semi-supervised works so well here: The GNN's message passing IS a form of label propagation. If node A is labeled "machine learning paper" and A's neighbor B is unlabeled, then after one GCN layer, B's embedding contains information about A. The classifier can use this to correctly classify B, even though B has no label. The graph structure acts as "free supervision" — it tells the model which nodes are likely to have the same label (connected ones).

Unsupervised Training

No labels at all? Use the graph structure itself as the training signal. The DeepWalk/node2vec objective from Lecture 2 works perfectly:

L = − ∑(u,v) ∈ walks log( σ( zuT zv ) ) + ∑vn ∈ neg log( σ( −zuT zvn ) )

Nodes that co-occur on random walks should have high dot product similarity. The GNN is used as the encoder instead of a lookup table — same objective, better encoder.

Link Prediction and Graph Classification

GNNs aren't limited to node classification. For link prediction, compute zu and zv and predict whether edge (u,v) should exist (dot product or MLP). For graph classification, aggregate all node embeddings into a single graph embedding (mean/sum/max pooling over nodes), then classify the whole graph.

Forward Pass
X, A → GNN layers (K × aggregate + transform) → Z ∈ RN × d
Task Head
zv → softmax → ŷv (node classification) OR zuTzv (link prediction) OR mean(Z) → ŷ (graph classification)
Loss
Cross-entropy / MSE / contrastive — computed only on labeled nodes/edges
Backprop
Gradients flow through the task head and ALL K GNN layers, updating W1, ..., WK
In the semi-supervised GCN setting, how does the model classify unlabeled nodes when computing loss only on labeled nodes?

Chapter 6: GNNs Generalize CNNs

Here's a fact that makes the universe feel more unified: a CNN is just a GNN on a grid graph. If you don't believe me, let's build the connection.

Pixels Are Nodes; Adjacency Is a Grid

Take a 32×32 image — 1024 pixels. Define a graph where each pixel is a node, and two pixels are connected if they are adjacent (up, down, left, right, or diagonally). You get a grid graph. Each node has a feature vector equal to the pixel's RGB values.

Now apply a GNN with mean aggregation on this grid graph. What does one layer do?

For pixel (i,j), the aggregation collects features from the 8 neighboring pixels and averages them, then applies W. This is EXACTLY a 3×3 convolution with uniform weights. The convolutional kernel IS the aggregation weight matrix W.

CNNs are GNNs with a special inductive bias: CNNs assume the graph is a REGULAR GRID (every node has the same number of neighbors, arranged in a fixed spatial pattern). This lets us represent the aggregation weights as a small kernel that tiles across the image. GNNs generalize this to ARBITRARY graphs: irregular topology, variable degrees, no spatial structure required. The kernel weights (W matrix in GNN) are still shared across all nodes — just like CNN filters are shared across all pixel locations.

Why Image Data Doesn't Need a Full GNN

If graphs generalize CNNs, why don't we just run GNNs on images? Practical reasons:

But for data that is ACTUALLY a graph — molecules, social networks, citation graphs, protein interaction networks — there is no grid. The GNN's generality is not optional; it's required.

CNN as GNN on a Grid Graph

A 5x5 pixel grid with one node highlighted (orange). Its 8 connected neighbors are shown in teal. In a standard 3x3 conv, EXACTLY these neighbors' values are aggregated. Toggle "Graph View" to see the grid as an adjacency graph.

Grid view. Orange = selected node. Teal = its neighbors (the conv kernel footprint).
What is the key difference between a CNN and a GNN, and why does this matter for molecular or social network data?

Chapter 7: Transformers as GNNs

We said CNNs are GNNs on grid graphs. Now let's push further: Transformers are GNNs on fully connected graphs.

Self-Attention Is Message Passing

In a Transformer, self-attention computes, for each token i, a weighted sum of all other tokens' values:

hi = ∑j αij Vj
αij = softmax( QiT Kj / √dk )

Compare to GNN message passing at one layer:

hv(k) = AGGREGATE( { wuv · hu(k−1) : u ∈ N(v) } )

The structure is identical: each node aggregates weighted messages from its neighbors. In a Transformer, the "graph" is the fully connected graph — every token is a neighbor of every other token. The edge weights are the attention scores αij, which are computed dynamically from the content of the tokens (unlike fixed GCN weights W).

Three graph structures, three models:
Grid graph (each node has 8 spatial neighbors, fixed weights) = CNN
Sparse graph (each node has a few neighbors, fixed structure) = GNN/GCN
Fully connected graph (each node is a neighbor of all others, dynamic weights) = Transformer

The Efficiency-Expressivity Tradeoff

Why not just use Transformers for all graph data? Two reasons:

Quadratic cost: For a graph with N nodes, a Transformer attention layer requires O(N2) operations — computing a score for every pair. For a citation graph with 2 million papers, that's 4 trillion operations per layer. GNNs on the actual sparse graph (say, average degree 10) require O(10N) = O(N) operations per layer — 400,000x cheaper.

Structural inductive bias: Real graphs are sparse because there's a reason edges exist — chemical bonds, friend connections, citation links. These edges encode domain knowledge. Giving the model ALL possible edges (Transformer) dilutes this signal with noise from pairs that have no meaningful relationship.

The design choice is clear: use GNNs for data with a meaningful sparse graph structure; use Transformers when you want the model to learn which relationships matter (at quadratic cost).

ModelGraphEdge WeightsCost
CNNGrid (fixed)Kernel (static)O(N)
GNNSparse (given)Uniform / learnedO(E)
TransformerCompleteAttention (dynamic)O(N²)
Modern synthesis: Recent work (Graphormer, Graph-Transformers) uses Transformers on graphs but incorporates graph structure as positional encodings or attention biases. Sparse GNNs on subgraphs. This combines the expressivity of Transformers with the efficiency and inductive biases of GNNs.
A standard Transformer self-attention layer operates on which type of graph, and why are GNNs often preferred for social network or molecular data despite Transformers being more expressive?

Chapter 8: Inductive vs Transductive

We've now seen enough to close the loop on Chapter 0's problem: why do shallow embeddings fail for new nodes, and how do GNNs fix it?

Transductive Learning (Shallow Embeddings)

A lookup-table embedding is transductive: it can only make predictions for nodes that were present during training. The embedding for node v is the v-th row of the weight matrix Z. If v wasn't in the training graph, there is no v-th row.

Think of it like a phone book: you look up "Alice" and get her number. But a phone book can't tell you the number for someone who moved to town after it was printed.

Inductive Learning (GNNs)

A GNN is inductive: it can compute embeddings for nodes it has never seen before. Here's why:

The GNN's parameters are the weight matrices W1, ..., WK. These are NOT node-specific — they're shared functions. To embed a new node v, you simply RUN the GNN: collect v's features, collect v's neighbors' features, apply the message-passing layers. The GNN has never seen v before, but it can still compute zv using the same learned aggregation function it applied to training nodes.

Think of it like a recipe: A lookup table is like memorizing the exact height of every person you've ever met. A GNN is like having a formula: "measure leg length + torso length + head height." You can measure a new person you've never seen before using the SAME formula. The formula (GNN weights) generalizes; the memorized heights (embedding vectors) do not.

Generalizing Across Graphs

Inductivity means GNNs can even generalize across DIFFERENT graphs. Train a GNN for drug discovery on one set of molecules. Test it on a completely new molecule. The same aggregation weights apply — they've learned general chemistry patterns, not molecule-specific memories.

This is the critical distinction for deployed systems. In a real social network, thousands of new users sign up every hour. A transductive method is useless — you can't retrain from scratch every hour. An inductive GNN embeds new users in milliseconds by running the forward pass.

PropertyShallow (Transductive)GNN (Inductive)
New nodesFail — no entryWorks — run forward pass
New graphsFail — retrain neededWorks — same weights
Uses featuresNoYes (natural)
ParametersO(|V|·d)O(K·d²) — fixed!
ExampleDeepWalk, node2vecGCN, GraphSAGE, GAT
When would you still use shallow embeddings? If your graph never changes and new nodes never appear, shallow embeddings are simpler to implement and sometimes perform on par with GNNs (especially on homogeneous graphs without useful node features). In practice, GNNs are almost always preferred for production systems because of their inductivity and feature-awareness.
A drug discovery company trains a GNN on 10,000 known molecules. A new molecule (never seen during training) is synthesized. Can the GNN embed it? Why?

Chapter 9: Connections

Let's see where GNNs fit in the broader landscape — what they build on, what they unlock, and where to go next.

What We Built This Lesson

We started from the failure modes of shallow embeddings (Chapter 0) and derived the GNN update rule (Chapter 1) from first principles. We saw how computation graphs (Chapter 2) make the aggregation concrete, instantiated it as GCN (Chapter 3), derived the matrix form (Chapter 4), learned how to train it (Chapter 5), and connected it to CNNs (Chapter 6) and Transformers (Chapter 7). Chapter 8 showed why GNNs can generalize while lookup tables cannot.

The GCN Paper

The foundation is Kipf & Welling's GCN (ICLR 2017, arXiv:1609.02907). The full derivation — from spectral convolutions through Chebyshev approximation to the final layer rule — is covered in the Veanors paper lesson. Reading the paper after this lesson will feel completely natural.

CS224W Series

Lecture 1 covered the graph ML landscape and why graphs matter. Lecture 2 built shallow embeddings (DeepWalk, node2vec). This lecture replaced the lookup-table encoder with a neural network. Next:

Limitations of What We've Covered

LimitationFixed By
Mean aggregation loses information about neighbor distributionGraphSAGE (concat), GAT (attention), GIN (sum — maximally expressive)
Over-smoothing with many layers (embeddings converge)Skip connections, DropEdge, JK-networks
Can't distinguish certain graph structures (1-WL test bound)Higher-order GNNs, subgraph GNNs
Slow on very large graphs (full-batch message passing)Mini-batch sampling (GraphSAGE), cluster-GCN, neighbor sampling
Directed / heterogeneous graphs need special handlingR-GCN (relation-type-specific weights), HAN (heterogeneous attention)

Related Lessons

The unsupervised training signal (random walk co-occurrence) comes from Lecture 2: Node Embeddings. The Transformer connection explored in Chapter 7 ties back to the Attention & Transformers lesson. For molecular GNNs specifically, see how this connects to drug discovery via the GCN paper walkthrough.

"The key insight of deep learning is that you can learn representations. GNNs extend this to graphs: instead of hand-crafting graph features, let the network learn to aggregate neighborhood information in a task-optimal way."
— Jure Leskovec, CS224W