CS224W Lecture 3 — Graph Neural Networks 1

Chapter 0: Why Shallow Embeddings Fail

In Lecture 2 you learned to embed nodes using DeepWalk and node2vec. The idea was beautiful: assign every node a learnable vector, train those vectors so similar nodes end up nearby in embedding space. But there's a catch — one that becomes obvious the moment you try to use these embeddings in the real world.

Suppose you train node2vec on a social network with 1 million users. You get a 128-dimensional vector for each user — 128 million numbers total. Now a new user signs up. What's their embedding? There is none. The lookup table has no row for a node it has never seen. You'd have to retrain the entire model from scratch.

The fundamental problem with lookup tables: Shallow embeddings are transductive — they can only embed nodes that existed during training. They cannot generalize to new nodes, new graphs, or new domains. And they don't use node features at all. Every node's embedding is learned from scratch, independent of what that node actually represents.

The Three Failures

Let's be precise about what goes wrong with shallow (lookup-table) embeddings:

1. O(|V| · d) parameters. Every node gets its own d-dimensional row in the embedding matrix Z ∈ R^{|V| × d}. For a graph with 100 million nodes and 128 dimensions: 12.8 billion parameters. Just for the embeddings. Before any prediction head.

2. No parameter sharing. Node A's embedding is unrelated to node B's embedding, even if they play structurally identical roles. Two users who both have 10 friends in a tight clique should get similar embeddings — but a lookup table learns each independently, requiring enormous amounts of data to discover this pattern.

3. Ignores node features. Graphs in the real world come with rich features: protein graphs have amino acid properties, social graphs have user profiles, citation graphs have paper abstracts. A lookup table throws all of this away. Its embedding for node v is the same whether v represents a biologist named Alice or a retired plumber named Bob.

4. Transductive — can't generalize. The trained embedding only exists for training nodes. New nodes, new graphs, different graphs with the same structure — all require retraining.

What We Actually Need

We need an encoder function that takes a node and its graph context as INPUT and computes an embedding as OUTPUT. Not a lookup table that memorizes. A function that computes.

ENC(v, G) → z_v

This encoder should look at the node's features AND its position in the graph structure — its neighbors, its neighbors' neighbors, and so on. If you train this encoder on one graph, it should be able to compute embeddings for nodes it's never seen before, because it reasons from features and structure, not from memorized lookup indices.

That's exactly what a Graph Neural Network does. The rest of this lesson builds it up piece by piece.

The Shallow Embedding Problem

A lookup table has no embedding for new nodes. Click "Add New Node" to see what happens when a node appears that wasn't in training.

Lookup table: 5 nodes, 5 embedding vectors learned. All good.

What is the key limitation that makes shallow embeddings "transductive"?

They require too much GPU memory They can only embed nodes seen during training — a lookup table has no entry for new nodes, and retraining the entire model from scratch is required to add even one new node They can't capture multi-hop neighborhood information

Chapter 1: Aggregate & Transform

Here is the key idea of GNNs. Say you want to embed a node v. Instead of looking it up in a table, you compute its embedding by looking at who its neighbors are and what those neighbors look like.

But then how do you embed v's neighbors? By looking at THEIR neighbors. And v's neighbors' neighbors? By looking at THEIR neighbors. You recurse K times, where K is the number of "layers" in the GNN.

The core intuition: A node's embedding should capture its role in the local graph structure. A node surrounded by high-degree hubs should embed differently from a node surrounded by leaves. To know the neighborhood, you must aggregate information from neighbors.

One Layer of a GNN

One GNN layer does two things: (1) collect information from all neighbors, (2) combine it with the node's own information using a neural network. Let h_v^(k) denote the embedding of node v after k layers. Initially (k=0), each node's embedding is just its feature vector x_v:

h_v⁽⁰⁾ = x_v

After one layer of aggregation and transformation:

h_v^(k) = σ( W_k · AGGREGATE( {h_u^(k-1) : u ∈ N(v)} ) + B_k · h_v^(k-1) )

Breaking this down symbol by symbol:

N(v): the set of neighbors of v
AGGREGATE(...): some function that collapses a set of vectors into one vector (e.g., mean, sum, max)
W_k: weight matrix applied to the aggregated neighbor information
B_k: weight matrix applied to v's OWN previous embedding (self-loop)
σ: nonlinearity (ReLU, tanh...)

After K layers, h_v^(K) is the final embedding z_v for node v. It has "seen" its K-hop neighborhood.

The critical insight about parameter sharing: The weight matrices W_k and B_k are the same for all nodes at layer k. Node A and node B use the same W₁ to aggregate their respective neighbors. This is like how a CNN uses the same filter kernel at every spatial location — the parameters are shared across the graph, regardless of graph size. This is WHY GNNs scale and WHY they can generalize to new nodes.

How Many Layers?

With K = 1, each node sees its direct neighbors. With K = 2, each node sees its 2-hop neighborhood (neighbors of neighbors). With K = 3, it sees 3 hops. In practice, K = 2 or K = 3 works well for most tasks. Beyond K = 5 or 6, performance often degrades — a phenomenon called over-smoothing, where all nodes' embeddings converge to the same vector because information has spread too far.

Layer 0 (Input)

h_v⁽⁰⁾ = x_v — raw feature vector for each node. Shape: [|V|, d_in]

↓ aggregate neighbors + transform

Layer 1

h_v⁽¹⁾ encodes 1-hop neighborhood. Each node "knows about" its immediate neighbors. Shape: [|V|, d]

↓ aggregate neighbors + transform

Layer 2

h_v⁽²⁾ encodes 2-hop neighborhood. Each node "knows about" its neighbors' neighbors. Shape: [|V|, d]

↓ (K times total)

Output

z_v = h_v^(K) — final embedding, ready for downstream tasks.

In the GNN update equation h_v^(k) = σ(W_k · AGGREGATE(...) + B_k · h_v^(k-1)), what do W_k and B_k represent, and how are they applied across nodes?

W_k and B_k are separate weight matrices for each node, learned independently during training W_k (for neighbor aggregation) and B_k (for self-update) are SHARED across all nodes at layer k — every node uses the same weights, enabling generalization to new nodes W_k transforms the node's own features and B_k transforms neighbors' features

Chapter 2: Computation Graphs

Let's make the aggregation process concrete. When we compute node v's embedding with K GNN layers, we're really building a tree — the computation graph rooted at v. This tree records exactly what information flows into v's embedding.

At the bottom of the tree (layer 0) are the raw input features of every node in v's K-hop neighborhood. Each layer aggregates information one step "upward" toward the root, until we reach the root (layer K), which is v's final embedding.

A computation graph is different for every node. Node A may have 3 neighbors, so its layer-1 computation graph has 3 branches. Node B may have 7 neighbors, so its layer-1 computation graph has 7 branches. Same model — but different computational shapes depending on the neighborhood topology. This is a key difference from standard neural networks, which have fixed, uniform computation graphs for every input.

An Example

Consider a small graph: A–B–C–D, a simple path. Let's compute node B's embedding with K = 2 layers.

Layer 0: Every node starts with its feature vector: x_A, x_B, x_C, x_D.

Layer 1: We need h_B⁽¹⁾. B's neighbors are A and C. So we aggregate {h_A⁽⁰⁾, h_C⁽⁰⁾} = {x_A, x_C}, transform with W₁, combine with h_B⁽⁰⁾ = x_B via B₁, and activate. Similarly we compute h_A⁽¹⁾ (needs x_B) and h_C⁽¹⁾ (needs x_B, x_D).

Layer 2: We need h_B⁽²⁾. B's neighbors are A and C. So we aggregate {h_A⁽¹⁾, h_C⁽¹⁾}, transform, combine with h_B⁽¹⁾. Notice that h_A⁽¹⁾ already contains information about x_B, and h_C⁽¹⁾ contains information about x_B and x_D. So h_B⁽²⁾ contains information about the entire 2-hop neighborhood of B: {A, B, C, D}.

The Tree Structure

Unrolling this gives a tree rooted at B. The tree is NOT the graph — it's the computation graph. A node can appear multiple times in the tree (e.g., B appears as both the root and as a leaf in A's branch). But crucially, these different appearances share the SAME weight matrices W₁, B₁, W₂, B₂ — shared parameters, different computations.

Computation Graph Explorer

Click a node to see its K=2 computation graph (the tree of neighbors that determines its embedding). The same weight matrices are shared at every tree node at the same layer.

Click a node to expand its K=2 computation graph.

Why is node v's computation graph a TREE and not the original graph?

Because GNNs remove cycles from the graph during preprocessing Because each layer only looks at nodes within a fixed radius Because unrolling the K-hop aggregation creates a recursive tree where the same node can appear at multiple branches (since a shared neighbor contributes to multiple paths), but the computation flows one direction — from leaves (features) to root (embedding)

Chapter 3: GCN — The Simplest GNN

Now let's look at the most famous concrete GNN: the Graph Convolutional Network (GCN), introduced by Kipf & Welling at ICLR 2017 (arXiv: 1609.02907). It's the simplest possible instantiation of the aggregate-and-transform idea, and it became the foundational model for nearly all graph deep learning that followed.

The GCN Update Rule

GCN uses mean aggregation — take the average of all neighbors' embeddings, then apply a linear transformation and nonlinearity:

h_v^(k) = σ( W_k · ¹⁄_|N(v)| ∑_{u ∈ N(v)} h_u^(k-1) + B_k · h_v^(k-1) )

Where |N(v)| is the number of neighbors of v. The mean normalizes by degree — a node with 100 neighbors doesn't produce a 100× larger signal than a node with 1 neighbor. This normalization is crucial for numerical stability.

Often simplified by folding the self-loop in: The most common GCN formulation adds a self-loop to the graph (Ã = A + I) and combines W_k and B_k into a single matrix. Then the update is just: h_v^(k) = σ(W_k · mean({h_u^(k-1) : u ∈ Ñ(v)})) where Ñ(v) = N(v) ∪ {v}. Simpler and often slightly better.

Data Flow: What Goes In, What Comes Out

Let's be concrete about shapes. Say we have a graph with 6 nodes, each with 3 input features. We use GCN with K=2 layers and hidden dimension d=4.

Input: X ∈ R^{6 × 3} — feature matrix. Row i = feature vector of node i.

Layer 1: W₁ ∈ R^{3 × 4}. After aggregation: H⁽¹⁾ ∈ R^{6 × 4}.

Layer 2: W₂ ∈ R^{4 × 4}. After aggregation: H⁽²⁾ ∈ R^{6 × 4}.

Total parameters: W₁ (12) + W₂ (16) = 28. Completely independent of |V| = 6.

Compare to shallow embeddings: 6 × 4 = 24 parameters, but scaling to 6 million nodes would need 24 million. GCN still uses 28.

The Three Steps of One GCN Layer

1. SEND — Neighbors Broadcast

Each neighbor u sends its current embedding h_u^(k-1) to node v.

↓

2. AGGREGATE — Mean Pool

Node v averages all received messages: m_v = (1/|N(v)|) ∑_{u ∈ N(v)} h_u^(k-1).

↓

3. UPDATE — Neural Network

h_v^(k) = ReLU(W_k · m_v + B_k · h_v^(k-1)).

GCN Message Passing — Step by Step

A small graph with 6 nodes, each with a scalar feature (shown as size/color). Watch how features propagate through 2 GCN layers. Press Step to advance one layer. Watch embeddings evolve.

Layer 0: Raw input features. Each node shows its scalar feature value.

python
# GCN layer in PyTorch — exact implementation
import torch
import torch.nn as nn

class GCNLayer(nn.Module):
    def __init__(self, in_dim, out_dim):
        super().__init__()
        # W and B combined into one linear layer (operates on [neighbor_agg || self])
        self.linear = nn.Linear(in_dim, out_dim)
        self.activation = nn.ReLU()

    def forward(self, H, A_norm):
        # H: [N, d_in] — node embeddings
        # A_norm: [N, N] — normalized adjacency with self-loops
        # A_norm[i,j] = 1/sqrt(deg_i * deg_j) for (i,j) edge
        aggregated = A_norm @ H        # [N, d_in] — weighted mean of neighbors
        return self.activation(self.linear(aggregated))  # [N, d_out]

class GCN(nn.Module):
    def __init__(self, in_dim, hidden_dim, out_dim, num_classes):
        super().__init__()
        self.layer1 = GCNLayer(in_dim, hidden_dim)
        self.layer2 = GCNLayer(hidden_dim, out_dim)
        self.classifier = nn.Linear(out_dim, num_classes)

    def forward(self, X, A_norm):
        # X: [N, in_dim] — raw node features
        H1 = self.layer1(X, A_norm)    # [N, hidden_dim]
        H2 = self.layer2(H1, A_norm)   # [N, out_dim]
        return self.classifier(H2)     # [N, num_classes]

In GCN, why does the mean aggregation divide by |N(v)| (the number of neighbors)?

To make the aggregation differentiable with respect to the graph structure Because most graphs are regular (all nodes have the same degree) To normalize for degree — otherwise high-degree nodes would produce arbitrarily large signals, making training unstable and making the model degree-sensitive rather than structure-sensitive

Chapter 4: Matrix Form of GCN

The node-by-node update rule is conceptually clear, but for actual computation we want a matrix formula that lets us update all nodes simultaneously with efficient linear algebra. Kipf & Welling's paper derives this matrix form carefully.

Building Up to the Formula

Start with the adjacency matrix A (N × N). The (i,j) entry is 1 if there's an edge, 0 otherwise. Now add self-loops: Ã = A + I_N. This means each node is its own neighbor, so the self-connection B_k term naturally merges into the aggregation.

Let D̃ be the diagonal degree matrix of Ã: D̃_ii = ∑_j Ã_ij = deg(i) + 1 (degree plus self-loop). Then the mean aggregation over all nodes simultaneously is:

H^(k) = σ( D̃^−½ Ã D̃^−½ H^(k−1) W_k )

This is the GCN equation from the paper. Let's unpack each piece:

Ã = A + I: Adjacency with self-loops. The self-loop is why we don't need a separate B_k term anymore — node v is its own neighbor.

D̃^−½ Ã D̃^−½: The symmetric normalized adjacency. Entry (i,j) becomes 1/√(deg(i)·deg(j)) for connected nodes. This is a more symmetric normalization than simple row-normalization.

H^(k−1) W_k: Linear transformation applied to all node embeddings. W_k ∈ R^{d_k−1 × d_k} is the only learned parameter at layer k.

σ: Nonlinearity (ReLU between layers, softmax at the last layer for classification).

Why Symmetric Normalization?

Simple row normalization D̃⁻¹Ã would make row i sum to 1. But this is asymmetric — if node A has 10 neighbors and node B has 2 neighbors, information flowing from A to B gets divided by 10, while information flowing from B to A gets divided by 2. The symmetric normalization D̃^−½Ã D̃^−½ ensures that BOTH the sender and receiver degree are accounted for. Edge (i,j) gets weight 1/√(d_i·d_j). This is more geometrically natural and produces better empirical results.

Connection to spectral graph theory: D̃^−½Ã D̃^−½ is directly related to the normalized graph Laplacian L = I − D^−½AD^−½. The paper motivates GCN from a spectral perspective — it's a first-order approximation of spectral graph convolutions using Chebyshev polynomials. But you don't need spectral theory to USE GCN. The message passing view is sufficient. The spectral view just explains WHY this particular normalization is theoretically justified.

Pre-computing the Propagation Matrix

The normalized adjacency Ã = D̃^−½Ã D̃^−½ depends only on the graph structure, not on the learned parameters. You can compute it ONCE before training and reuse it throughout. This makes each training step extremely fast: just matrix multiplications Ã H^(k−1) W_k.

python
# Pre-compute normalized adjacency — done ONCE before training
import torch
import scipy.sparse as sp
import numpy as np

def normalize_adjacency(A):
    # A: scipy sparse adjacency matrix [N, N]
    N = A.shape[0]
    A_tilde = A + sp.eye(N)            # Ã = A + I (self-loops)
    D_tilde = np.array(A_tilde.sum(1)).flatten()  # D̃ diagonal
    D_inv_sqrt = sp.diags(D_tilde ** -0.5)       # D̃^(-½)
    A_norm = D_inv_sqrt @ A_tilde @ D_inv_sqrt    # D̃^(-½) Ã D̃^(-½)
    return torch.FloatTensor(A_norm.toarray())    # [N, N] dense tensor

# Pre-compute once:
A_norm = normalize_adjacency(A)  # cached for the entire training run

# Each forward pass is just:
# H^(k) = relu(A_norm @ H^(k-1) @ W_k)

What does the symmetric normalization D̃^-½Ã D̃^-½ achieve compared to simple row-normalization D̃^-1Ã?

It makes the adjacency matrix sparse, reducing computation It prevents gradient vanishing during backpropagation It accounts for BOTH the sender and receiver degree — edge (i,j) gets weight 1/√(d_i · d_j), whereas row-normalization only divides by the receiver degree, creating asymmetric information flow

Chapter 5: Training a GNN

You have a GNN that maps input features X and graph structure A to node embeddings Z. Now how do you train it? The answer depends on how much labeled data you have.

Supervised Node Classification

You have a graph with N nodes. Some nodes have labels (e.g., "spam" or "real user"). The GNN produces an embedding z_v for every node. A linear classifier on top predicts the label:

ŷ_v = softmax(W_out · z_v)

The loss is cross-entropy over all LABELED nodes:

L = − ∑_{v ∈ V_labeled} y_v · log(ŷ_v)

Backprop flows through the classification head, through z_v, through all K GNN layers. The weight matrices W₁, ..., W_K, W_out are all updated.

Semi-Supervised (The GCN Paper's Setting)

The Kipf & Welling paper's key setup: only a TINY fraction of nodes are labeled (e.g., 20 labeled nodes out of 2708 total in the Cora citation graph — less than 1%). The GNN still runs on ALL nodes and uses the ENTIRE graph structure. Even unlabeled nodes influence the embeddings of labeled nodes through the message passing. The model learns to propagate label information through the graph structure.

Why semi-supervised works so well here: The GNN's message passing IS a form of label propagation. If node A is labeled "machine learning paper" and A's neighbor B is unlabeled, then after one GCN layer, B's embedding contains information about A. The classifier can use this to correctly classify B, even though B has no label. The graph structure acts as "free supervision" — it tells the model which nodes are likely to have the same label (connected ones).

Unsupervised Training

No labels at all? Use the graph structure itself as the training signal. The DeepWalk/node2vec objective from Lecture 2 works perfectly:

L = − ∑_{(u,v) ∈ walks} log( σ( z_u^T z_v ) ) + ∑_{v_n ∈ neg} log( σ( −z_u^T z_{v_n} ) )

Nodes that co-occur on random walks should have high dot product similarity. The GNN is used as the encoder instead of a lookup table — same objective, better encoder.

Link Prediction and Graph Classification

GNNs aren't limited to node classification. For link prediction, compute z_u and z_v and predict whether edge (u,v) should exist (dot product or MLP). For graph classification, aggregate all node embeddings into a single graph embedding (mean/sum/max pooling over nodes), then classify the whole graph.

Forward Pass

X, A → GNN layers (K × aggregate + transform) → Z ∈ R^{N × d}

↓

Task Head

z_v → softmax → ŷ_v (node classification) OR z_u^Tz_v (link prediction) OR mean(Z) → ŷ (graph classification)

↓

Loss

Cross-entropy / MSE / contrastive — computed only on labeled nodes/edges

↓

Backprop

Gradients flow through the task head and ALL K GNN layers, updating W₁, ..., W_K

In the semi-supervised GCN setting, how does the model classify unlabeled nodes when computing loss only on labeled nodes?

It ignores unlabeled nodes entirely during training Message passing propagates information from labeled to unlabeled nodes — unlabeled nodes aggregate embeddings from their labeled neighbors, allowing the classifier to transfer label information through the graph structure Unlabeled nodes are assigned pseudo-labels using k-nearest neighbors before training

Chapter 6: GNNs Generalize CNNs

Here's a fact that makes the universe feel more unified: a CNN is just a GNN on a grid graph. If you don't believe me, let's build the connection.

Pixels Are Nodes; Adjacency Is a Grid

Take a 32×32 image — 1024 pixels. Define a graph where each pixel is a node, and two pixels are connected if they are adjacent (up, down, left, right, or diagonally). You get a grid graph. Each node has a feature vector equal to the pixel's RGB values.

Now apply a GNN with mean aggregation on this grid graph. What does one layer do?

For pixel (i,j), the aggregation collects features from the 8 neighboring pixels and averages them, then applies W. This is EXACTLY a 3×3 convolution with uniform weights. The convolutional kernel IS the aggregation weight matrix W.

CNNs are GNNs with a special inductive bias: CNNs assume the graph is a REGULAR GRID (every node has the same number of neighbors, arranged in a fixed spatial pattern). This lets us represent the aggregation weights as a small kernel that tiles across the image. GNNs generalize this to ARBITRARY graphs: irregular topology, variable degrees, no spatial structure required. The kernel weights (W matrix in GNN) are still shared across all nodes — just like CNN filters are shared across all pixel locations.

Why Image Data Doesn't Need a Full GNN

If graphs generalize CNNs, why don't we just run GNNs on images? Practical reasons:

CNNs exploit the grid structure to use highly optimized CUDA kernels. GNNs on arbitrary graphs can't use the same optimizations.
For images, the grid structure is always the same. No need for the generality of arbitrary graphs.
CNNs have decades of engineering behind them (BatchNorm, weight initialization, learning rate schedules). Just use them for image data.

But for data that is ACTUALLY a graph — molecules, social networks, citation graphs, protein interaction networks — there is no grid. The GNN's generality is not optional; it's required.

CNN as GNN on a Grid Graph

A 5x5 pixel grid with one node highlighted (orange). Its 8 connected neighbors are shown in teal. In a standard 3x3 conv, EXACTLY these neighbors' values are aggregated. Toggle "Graph View" to see the grid as an adjacency graph.

Grid view. Orange = selected node. Teal = its neighbors (the conv kernel footprint).

What is the key difference between a CNN and a GNN, and why does this matter for molecular or social network data?

CNNs are faster because they use GPUs, while GNNs must run on CPUs GNNs use more parameters than CNNs for the same task CNNs assume a fixed regular grid (every node has exactly 8 neighbors in the same spatial arrangement). GNNs work on arbitrary graphs with variable and irregular neighbor sets — essential for molecules (atoms have 1-4 bonds), social networks, and citation graphs where no grid structure exists

Chapter 7: Transformers as GNNs

We said CNNs are GNNs on grid graphs. Now let's push further: Transformers are GNNs on fully connected graphs.

Self-Attention Is Message Passing

In a Transformer, self-attention computes, for each token i, a weighted sum of all other tokens' values:

h_i = ∑_j α_ij V_j

α_ij = softmax( Q_i^T K_j / √d_k )

Compare to GNN message passing at one layer:

h_v^(k) = AGGREGATE( { w_uv · h_u^(k−1) : u ∈ N(v) } )

The structure is identical: each node aggregates weighted messages from its neighbors. In a Transformer, the "graph" is the fully connected graph — every token is a neighbor of every other token. The edge weights are the attention scores α_ij, which are computed dynamically from the content of the tokens (unlike fixed GCN weights W).

Three graph structures, three models:
Grid graph (each node has 8 spatial neighbors, fixed weights) = CNN
Sparse graph (each node has a few neighbors, fixed structure) = GNN/GCN
Fully connected graph (each node is a neighbor of all others, dynamic weights) = Transformer

The Efficiency-Expressivity Tradeoff

Why not just use Transformers for all graph data? Two reasons:

Quadratic cost: For a graph with N nodes, a Transformer attention layer requires O(N²) operations — computing a score for every pair. For a citation graph with 2 million papers, that's 4 trillion operations per layer. GNNs on the actual sparse graph (say, average degree 10) require O(10N) = O(N) operations per layer — 400,000x cheaper.

Structural inductive bias: Real graphs are sparse because there's a reason edges exist — chemical bonds, friend connections, citation links. These edges encode domain knowledge. Giving the model ALL possible edges (Transformer) dilutes this signal with noise from pairs that have no meaningful relationship.

The design choice is clear: use GNNs for data with a meaningful sparse graph structure; use Transformers when you want the model to learn which relationships matter (at quadratic cost).

Model	Graph	Edge Weights	Cost
CNN	Grid (fixed)	Kernel (static)	O(N)
GNN	Sparse (given)	Uniform / learned	O(E)
Transformer	Complete	Attention (dynamic)	O(N²)

Modern synthesis: Recent work (Graphormer, Graph-Transformers) uses Transformers on graphs but incorporates graph structure as positional encodings or attention biases. Sparse GNNs on subgraphs. This combines the expressivity of Transformers with the efficiency and inductive biases of GNNs.

A standard Transformer self-attention layer operates on which type of graph, and why are GNNs often preferred for social network or molecular data despite Transformers being more expressive?

Transformers operate on trees (hierarchical attention), but GNNs handle cycles better Transformers operate on grid graphs, while GNNs operate on fully connected graphs Transformers operate on the fully connected graph (every token attends to every other), costing O(N²). GNNs on the actual sparse graph cost O(E) — orders of magnitude cheaper when N is large, and the known edges encode meaningful domain structure that guides the model

Chapter 8: Inductive vs Transductive

We've now seen enough to close the loop on Chapter 0's problem: why do shallow embeddings fail for new nodes, and how do GNNs fix it?

Transductive Learning (Shallow Embeddings)

A lookup-table embedding is transductive: it can only make predictions for nodes that were present during training. The embedding for node v is the v-th row of the weight matrix Z. If v wasn't in the training graph, there is no v-th row.

Think of it like a phone book: you look up "Alice" and get her number. But a phone book can't tell you the number for someone who moved to town after it was printed.

Inductive Learning (GNNs)

A GNN is inductive: it can compute embeddings for nodes it has never seen before. Here's why:

The GNN's parameters are the weight matrices W₁, ..., W_K. These are NOT node-specific — they're shared functions. To embed a new node v, you simply RUN the GNN: collect v's features, collect v's neighbors' features, apply the message-passing layers. The GNN has never seen v before, but it can still compute z_v using the same learned aggregation function it applied to training nodes.

Think of it like a recipe: A lookup table is like memorizing the exact height of every person you've ever met. A GNN is like having a formula: "measure leg length + torso length + head height." You can measure a new person you've never seen before using the SAME formula. The formula (GNN weights) generalizes; the memorized heights (embedding vectors) do not.

Generalizing Across Graphs

Inductivity means GNNs can even generalize across DIFFERENT graphs. Train a GNN for drug discovery on one set of molecules. Test it on a completely new molecule. The same aggregation weights apply — they've learned general chemistry patterns, not molecule-specific memories.

This is the critical distinction for deployed systems. In a real social network, thousands of new users sign up every hour. A transductive method is useless — you can't retrain from scratch every hour. An inductive GNN embeds new users in milliseconds by running the forward pass.

Property	Shallow (Transductive)	GNN (Inductive)
New nodes	Fail — no entry	Works — run forward pass
New graphs	Fail — retrain needed	Works — same weights
Uses features	No	Yes (natural)
Parameters	O(\|V\|·d)	O(K·d²) — fixed!
Example	DeepWalk, node2vec	GCN, GraphSAGE, GAT

When would you still use shallow embeddings? If your graph never changes and new nodes never appear, shallow embeddings are simpler to implement and sometimes perform on par with GNNs (especially on homogeneous graphs without useful node features). In practice, GNNs are almost always preferred for production systems because of their inductivity and feature-awareness.

A drug discovery company trains a GNN on 10,000 known molecules. A new molecule (never seen during training) is synthesized. Can the GNN embed it? Why?

No — the GNN's embedding matrix has no row for the new molecule No — the graph structure of the new molecule hasn't been seen in training Yes — GNNs are inductive. The learned weight matrices W₁,...,W_K are functions of neighborhood structure, not memorized embeddings. Running the forward pass on the new molecule's graph computes its embedding using the same aggregation logic trained on the 10,000 known molecules

Chapter 9: Connections

Let's see where GNNs fit in the broader landscape — what they build on, what they unlock, and where to go next.

What We Built This Lesson

We started from the failure modes of shallow embeddings (Chapter 0) and derived the GNN update rule (Chapter 1) from first principles. We saw how computation graphs (Chapter 2) make the aggregation concrete, instantiated it as GCN (Chapter 3), derived the matrix form (Chapter 4), learned how to train it (Chapter 5), and connected it to CNNs (Chapter 6) and Transformers (Chapter 7). Chapter 8 showed why GNNs can generalize while lookup tables cannot.

The GCN Paper

The foundation is Kipf & Welling's GCN (ICLR 2017, arXiv:1609.02907). The full derivation — from spectral convolutions through Chebyshev approximation to the final layer rule — is covered in the Veanors paper lesson. Reading the paper after this lesson will feel completely natural.

CS224W Series

Lecture 1 covered the graph ML landscape and why graphs matter. Lecture 2 built shallow embeddings (DeepWalk, node2vec). This lecture replaced the lookup-table encoder with a neural network. Next:

Lecture 4 (coming soon): GraphSAGE (sample-based induction), GAT (graph attention — attention-weighted aggregation). Different AGGREGATE functions, dramatically different behavior.
Lecture 5: Over-smoothing, over-squashing, and the pitfalls of deep GNNs.
Lecture 6: Graph classification with pooling (hierarchical, differentiable).

Limitations of What We've Covered

Limitation	Fixed By
Mean aggregation loses information about neighbor distribution	GraphSAGE (concat), GAT (attention), GIN (sum — maximally expressive)
Over-smoothing with many layers (embeddings converge)	Skip connections, DropEdge, JK-networks
Can't distinguish certain graph structures (1-WL test bound)	Higher-order GNNs, subgraph GNNs
Slow on very large graphs (full-batch message passing)	Mini-batch sampling (GraphSAGE), cluster-GCN, neighbor sampling
Directed / heterogeneous graphs need special handling	R-GCN (relation-type-specific weights), HAN (heterogeneous attention)

Related Lessons

The unsupervised training signal (random walk co-occurrence) comes from Lecture 2: Node Embeddings. The Transformer connection explored in Chapter 7 ties back to the Attention & Transformers lesson. For molecular GNNs specifically, see how this connects to drug discovery via the GCN paper walkthrough.

"The key insight of deep learning is that you can learn representations. GNNs extend this to graphs: instead of hand-crafting graph features, let the network learn to aggregate neighborhood information in a task-optimal way."
— Jure Leskovec, CS224W