CS224W Lecture 19 — Conclusion

Chapter 0: The Journey

You started the course with one core observation: most interesting data isn't a grid of pixels or a linear sequence of words. It's a web of relationships. Proteins interact with proteins. Users connect to items. Atoms bond to atoms. Papers cite papers. Genes regulate genes. The graph is the native data structure for this kind of knowledge.

The central challenge: how do you make a neural network that respects graph structure? You can't unroll a graph into a vector without destroying the structure. You can't apply convolution (which assumes a grid) to an irregular graph. The field needed a new paradigm — and that paradigm is message passing.

The Arc of the Course

Node Embeddings (Lec 2)

The first attempt: DeepWalk, Node2Vec. Learn low-dimensional vectors for nodes by random-walking the graph. Works, but doesn't generalize to new graphs or use node features.

↓

GNN: Message Passing (Lec 3-5)

The core invention: nodes send messages along edges, aggregate neighbor information, and update their own representation. GCN, GraphSAGE, GAT. Generalizes across graphs; uses features. The rest of the course builds on this foundation.

↓

Theory (Lec 6-7)

How expressive are GNNs? The WL isomorphism test. GINs that are maximally expressive. Graph Transformers that go beyond WL via positional encodings. Understanding the limits.

↓

Applications (Lec 8-16)

Heterogeneous graphs, knowledge graphs, recommender systems, relational deep learning, large-scale deployment, LLM+GNN integration. Real-world engineering at scale.

↓

Frontier (Lec 17-18)

Agents on graphs. Generative models for graphs. The field moves from prediction to reasoning and creation.

The through-line: Every lecture answers the same question differently — how do we learn from structure? The answer keeps evolving: shallow embeddings → learned aggregation → expressive GNNs → scalable frameworks → hybrid LLM+GNN → generative models. Each generation solves the limitations of the previous one.

The Course Timeline: Expressive Power vs. Scale

Plotting the major methods by their expressive power (how much graph structure they capture) and scale (how large a graph they can handle). Click a point to see the lecture.

What fundamental limitation of shallow node embeddings (DeepWalk, Node2Vec) motivated the development of message-passing GNNs?

Shallow embeddings require too much memory Shallow embeddings can only handle directed graphs Shallow embeddings are transductive — they assign vectors to specific nodes seen during training and cannot generalize to new nodes or use node features. GNNs compute embeddings from features and graph structure, enabling inductive generalization

Chapter 1: The Core Ideas

Everything in graph ML reduces to a small set of ideas. If you understand these deeply, you can derive the rest. Here they are, one by one.

1. Message Passing: The Universal Primitive

At every layer l, each node v aggregates messages from its neighbors N(v), then updates its own representation:

m_v^(l) = AGG({h_u^(l-1) : u ∈ N(v)})

h_v^(l) = UPD(h_v^(l-1), m_v^(l))

The choice of AGG (sum, mean, max, attention-weighted) and UPD (MLP, GRU) defines the GNN variant. Sum = GIN (most expressive). Mean = GCN (normalizes by degree). Attention-weighted = GAT (learnable importance). The framework unifies hundreds of papers.

2. The WL Hierarchy: Expressiveness Has a Ceiling

The Weisfeiler-Lehman graph isomorphism test is the theoretical ceiling for message-passing GNNs. Two graphs that fool the 1-WL test (assign the same color sequence) are indistinguishable by any message-passing GNN, no matter how deep or wide. Regular graphs of the same size fool 1-WL. Going beyond requires higher-order WL (k-WL) or structural encodings.

The practical consequence: GCN, GraphSAGE, GAT — they're all sub-WL (mean aggregation) or WL-equivalent (sum aggregation). If your task requires distinguishing structures that 1-WL cannot tell apart (e.g., whether a graph has a 5-cycle vs. 6-cycle), standard GNNs will fail. You need graph transformers or subgraph GNNs.

3. Positional Encoding: Breaking Symmetry

Without positional encoding, all nodes in a symmetric graph get the same embedding. There's no way to distinguish node 3 from node 7 in a regular graph — they look identical to the GNN. Positional encodings break this symmetry by giving each node a unique structural signature: random walk landing probabilities, Laplacian eigenvectors, or distance to anchor nodes.

4. Graph-Level Pooling: From Nodes to Graphs

Many tasks need a graph-level representation (predict if this molecule is toxic). Graph pooling aggregates node representations into a single graph vector. Simple: sum/mean all node embeddings. Better: hierarchical pooling (DiffPool, MinCutPool) that mimics the hierarchical structure of real graphs.

Message Passing: One Layer Visualized

Watch one layer of GNN message passing. Each node collects colored messages from its neighbors, aggregates them, and produces a new representation. Click "Next Layer" to see the representations after aggregation.

Why does mean aggregation (as used by GCN) make the GNN strictly less expressive than sum aggregation (GIN)?

Mean aggregation is slower to compute Mean aggregation cannot distinguish a node with 2 neighbors of feature [1,0] from a node with 4 neighbors of feature [1,0] — both give mean [1,0]. Sum gives [2,0] vs [4,0], preserving multiset structure Mean aggregation doesn't use edge weights

Chapter 2: The Application Landscape

Graph ML is not one field solving one problem — it's a general framework that has been successfully applied across six major domains. Each domain has different graph types, different tasks, and different engineering challenges.

Domain	Graph type	Task	Key method	Impact
Drug discovery	Molecular graphs (atoms = nodes, bonds = edges)	Property prediction, generation	MPNN, JT-VAE, GCPN	Insilico: Phase II drug in 18 months
Social networks	User-user, user-item bipartite	Link prediction, recommendation	GraphSAGE, LightGCN, PinSage	Pinterest: 150M MAU recommendations
Knowledge graphs	Entity-relation-entity triples	Link prediction, Q&A	TransE, RotatE, GNN-KG agents	Wikidata query answering
Recommender systems	User-item interaction bipartite	Rating prediction, ranking	NGCF, LightGCN, IDCF	Deployed at Netflix, Amazon, Alibaba
Relational databases	Table join graphs	Prediction over multi-table data	RelBench, HeteroGNN	Cross-table ML without feature engineering
3D structure	Point clouds + k-NN graphs	Protein structure, material properties	SE(3)-GNN, AlphaFold, GVP	Protein folding solved; material design

The unifying observation: In every domain, the graph captures relationships that flat feature vectors cannot. A drug's properties depend not just on which atoms are present, but on how they're connected. A user's preferences depend not just on their history, but on who they're connected to. The graph is the missing context.

Domain Coverage: Graph Type vs. Scale

Each domain has a characteristic scale (nodes) and graph type. Click a domain to see its typical graph structure and key challenge.

AlphaFold 2 (protein structure prediction) doesn't explicitly use a standard GNN. But it uses message-passing-like attention between residue pairs. What "graph" is it implicitly operating on?

The graph of known protein structures in the PDB database The bond graph of the protein's amino acid sequence The all-pairs contact graph of residues: attention between residue i and residue j implements message passing over the complete graph where edge weights are learned pairwise attention scores representing spatial proximity

Chapter 3: What Changed (2017 → 2025)

The field of graph ML has changed completely in eight years. Not just incremental improvement — a fundamental shift in what's possible and how we think about the problem.

2017: The Beginning

Kipf and Welling publish GCN (December 2016/published 2017). It's a 2-layer network on the Cora citation graph, 1433 node features, 7 classes, 2708 nodes. Training takes seconds. The key insight: spectral graph convolution can be approximated by simple first-order neighborhood aggregation. This creates the field.

GCN (2017) in one equation: H^(l+1) = σ(D̃^-½ Ã D̃^-½ H^(l) W^(l)) — where Ã = A + I (add self-loops), D̃ is the degree matrix. That's it. The entire model.

The Scaling Inflection (2019-2022)

OGB (Open Graph Benchmark) launches in 2020 — standard large-scale benchmarks. ogbn-products (2.4M nodes), ogbn-papers100M (111M nodes). Suddenly the question isn't "does it work on Cora?" but "does it work at web scale?" Mini-batch training, neighbor sampling (GraphSAGE), cluster-GCN, SIGN emerge to handle the scale challenge.

The Foundation Model Era (2023-2025)

The most dramatic shift: graph foundation models. Instead of training a separate GNN for each task and graph, can we train one massive model that transfers across graphs? This mirrors what GPT-3 did for text: one model, many tasks, zero-shot generalization.

The core challenge: Text has a fixed vocabulary (50k tokens). Graphs don't — every domain has different node/edge types (atoms vs. papers vs. users). How do you build a universal tokenizer for graphs?

Current approaches: (1) Use text descriptions as features (LLM-encoded), unifying all domains via language. (2) Learn graph-structural tokens via VQ-VAE on subgraphs. (3) Condition on graph schema/ontology. None has fully solved the problem yet — it's the open frontier.

Year	Milestone	Key innovation	Scale
2016/17	GCN	Spectral → spatial approximation	2.7K nodes
2017	GraphSAGE	Inductive learning, neighbor sampling	100K nodes
2018	GAT, GIN	Attention aggregation; expressiveness theory	—
2019	PyG 1.0	Standard implementation framework	—
2020	OGB, DGL-KE	Large-scale benchmarks; KG embeddings	111M nodes
2022	Graph Transformers	Global attention + positional encoding	—
2023	LLM+GNN, RelBench	Text-rich graphs, relational deep learning	—
2024	Graph Foundation Models	Cross-graph transfer learning	Emerging

What is the core challenge in building a "graph foundation model" analogous to GPT-3 for text?

Graphs are too large to train on with transformer attention Unlike text (which has a shared 50k-token vocabulary), graphs across domains have incompatible node/edge type systems — there's no universal "graph vocabulary" that applies to atoms, users, and papers simultaneously Graph foundation models already exist and are not a current research challenge

Chapter 4: Open Problems

The most important thing a good course can give you is not just what we know, but what we don't know. Here are the four hardest open problems in graph ML — problems where a PhD thesis could make a real contribution.

Open Problem 1: Scalability to Billion-Node Graphs

The web-scale graph (Facebook social graph: 3B nodes, 300B edges) is too large for any current full-batch GNN. Even mini-batch methods with neighbor sampling struggle: sampling 2-hop neighborhoods for batch size 1000 might touch millions of nodes. Current workarounds (GraphSAGE, cluster-GCN, SIGN) all introduce approximation errors. A principled approach to exact inference on billion-node graphs doesn't exist yet.

Open Problem 2: Expressiveness Beyond 1-WL Without Cubic Cost

We know 1-WL GNNs can't distinguish certain graph pairs (regular graphs, triangles vs. 4-cycles). Higher-order WL (2-WL, 3-WL) is strictly more expressive but costs O(n²) or O(n³) per layer — unusable at scale. Is there a way to get k-WL expressiveness with O(n) cost? Some recent work (subgraph GNNs, DS-WL) takes steps toward this, but there's no clean answer yet.

Open Problem 3: Dynamic Graphs

Real graphs change over time. Twitter's follow graph changes every second. A protein interaction network changes during cell differentiation. Current GNNs are static: they're trained on a snapshot and can't efficiently update when edges/nodes appear or disappear. Temporal GNNs (TGAT, TGN) exist but are expensive. Efficient streaming graph learning remains open.

Open Problem 4: Interpretability

When a GNN predicts that a molecule is toxic, which structural patterns caused that prediction? GNNExplainer, PGExplainer, and similar methods attempt to find important subgraphs, but their fidelity is limited: the "explanation" doesn't always faithfully reflect the model's actual computation. For high-stakes domains (drug safety, credit scoring), this is a blocker for deployment.

Open Problem Landscape

Click an open problem to see its current state, best partial solutions, and what a breakthrough would unlock.

A GNN trained on a protein interaction graph snapshot from 2022 needs to predict interactions for newly discovered proteins in 2024. What open problem does this expose?

Scalability — the 2024 protein graph is too large Expressiveness — the GNN can't distinguish protein substructures Dynamic graphs — the model was trained on a static snapshot and cannot efficiently update as new nodes (proteins) and edges (interactions) are discovered, requiring expensive retraining

Chapter 5: The Research Frontier

Three convergences are defining the next five years of graph ML research. Each combines graph learning with a currently dominant paradigm from the broader AI field.

Frontier 1: Graph + LLM

As covered in Lecture 16: LLMs give rich semantic understanding of node text; GNNs give structural reasoning. The question for 2025-2027: can we build a unified architecture that processes both modalities without treating them as separate? Graph Tokens approaches (mapping graph substructures to special tokens in the LLM's vocabulary) are the leading direction. If successful, this could enable LLMs to "think in graphs" natively.

The dream: A single model that, given a question about a knowledge graph, can seamlessly alternate between language reasoning ("Imatinib is a kinase inhibitor...") and graph traversal ("...connected to BCR-ABL1 by targets relation → hop to Dasatinib") without explicit tool calls. End-to-end differentiable graph-language reasoning.

Frontier 2: Graph + Diffusion

As covered in Lecture 18: diffusion models for graph generation are achieving state-of-the-art on molecular benchmarks. The open question: can these scale to protein-level graphs (thousands of nodes), and can the diffusion process be conditioned on arbitrary molecular properties rather than just property scores? Structure-based drug design (conditioning on 3D protein pocket geometry) is the target application.

Frontier 3: Graph + RL

As covered in Lecture 17: RL for agent optimization on knowledge graphs. Beyond this: using graphs as the world model in RL. The environment's causal structure (which variables affect which other variables) is a graph. An agent that can learn and exploit this causal graph will be dramatically more sample-efficient than one that doesn't.

The Grand Unified Vision: A foundation model that knows the graph of knowledge (KG), can reason over it in language (LLM), can generate new entities and relations (diffusion), and can learn optimal strategies for navigating and extending it (RL). This is not science fiction — it's the direction every major lab is moving toward.

Convergence	Current state	Key open question	Horizon
Graph + LLM	Cascade architectures (LLM then GNN)	Unified end-to-end architecture	2025-26
Graph + Diffusion	DiGress/GDSS for small molecules	Protein/material scale; 3D conditioning	2025-27
Graph + RL	GCPN; agents on KG	Causal graph as world model	2026-28
Graph Foundation Model	Specialized per-domain	Universal graph vocabulary	2026-28

The "Graph + LLM" research frontier aims for end-to-end differentiable graph-language reasoning. What does "end-to-end differentiable" mean in this context?

The model can process graphs and text without any separate preprocessing steps The model trains faster because gradients flow directly from output to input Gradients can flow through both the graph traversal and the language generation jointly during training — meaning the model can learn which graph hops are useful for which language outputs, rather than treating graph access as a fixed tool call external to the neural computation

Chapter 6: Practical Advice

You've learned the theory. Now, how do you actually use it? When you sit down with a new graph ML problem, which method do you reach for? Here's the decision procedure that experienced practitioners use.

Step 1: What is your task?

Node-level task (predict a property of each node): Use GCN or GraphSAGE for homophilic graphs (neighbors tend to have same label). Use GAT if you want the model to learn which neighbors are important. Use GIN if you need maximum expressiveness. Large graph? Use ClusterGCN or GraphSAGE with neighbor sampling.

Graph-level task (predict a property of the whole graph): Add a global pooling layer after your GNN. Sum pooling + GIN = maximum expressiveness. Mean pooling + GCN = good default. Hierarchical pooling (DiffPool) if the graph has natural hierarchical structure (e.g., molecular scaffolds).

Link prediction task (predict whether an edge exists): Use node2vec or GraphSAGE embeddings + dot product decoder for simple cases. Use R-GCN or CompGCN for knowledge graphs with typed relations. Use LightGCN for recommender systems (clean, fast, surprisingly strong).

Step 2: What is your graph?

Graph type	Recommended method	Why
Homogeneous, small (<100K nodes)	GCN or GIN	Fast, well-understood, good baselines
Homogeneous, large (>1M nodes)	GraphSAGE + mini-batch	Neighbor sampling scales linearly
Heterogeneous (typed nodes/edges)	R-GCN or HGT	Separate weight matrices per relation type
Knowledge graph (triples)	TransE/RotatE + GNN	Geometric relation embedding + propagation
Molecular graph	MPNN or DMPNN	Designed for chemical graph features
Text-rich node attributes	LLM encoder + GNN	LLM encodes text; GNN aggregates structure
No node features	Add positional encodings first	PE breaks symmetry; any GNN can then learn

Step 3: The PyG Ecosystem

PyTorch Geometric (PyG) is the standard library. It implements 100+ GNN layers, all major datasets, and handles the data loading/batching complexity of graphs. Start with torch_geometric.nn.GCNConv or SAGEConv, compose layers, add a linear head, train with standard PyTorch. For large-scale graphs: DGL (Deep Graph Library) has better distributed training support.

PyG: minimal GNN classifier
import torch
from torch_geometric.nn import GCNConv, global_mean_pool
import torch.nn.functional as F

class SimpleGNN(torch.nn.Module):
    def __init__(self, in_dim, hidden, out_dim):
        super().__init__()
        self.conv1 = GCNConv(in_dim, hidden)
        self.conv2 = GCNConv(hidden, hidden)
        self.head  = torch.nn.Linear(hidden, out_dim)

    def forward(self, x, edge_index, batch):
        x = F.relu(self.conv1(x, edge_index))   # Layer 1: aggregate
        x = F.relu(self.conv2(x, edge_index))   # Layer 2: aggregate
        x = global_mean_pool(x, batch)          # Graph pooling
        return self.head(x)                      # Classification head

# That's a complete graph classifier. ~15 lines of real PyG code.
# Swap GCNConv → GINConv for max expressiveness.
# Swap GCNConv → GATConv for attention aggregation.

You have a graph with 50M nodes, no node features, and you need node-level classification. Which combination best addresses both scale and the missing features?

Full-batch GIN with sum pooling GCN with a single layer to reduce computation Laplacian eigenvector positional encodings (to give nodes structural identity) + GraphSAGE with neighbor sampling (to handle 50M node scale via mini-batching). These two tools address the two distinct challenges independently

Chapter 7: Connections — All 19 Lectures

Here is the complete map of CS224W. Every lecture, in order, with its core contribution and the one key paper to read if you want to go deeper.

Lec	Topic	Core contribution	Key paper
01	Introduction	Graphs are the right abstraction for relational data	Leskovec et al., "Snap"
02	Node Embeddings	DeepWalk, Node2Vec: random walk → embedding	Perozzi et al. 2014; Grover & Leskovec 2016
03	GNN Basics	GCN: spectral → spatial aggregation	Kipf & Welling 2017
04	GNN Design	GraphSAGE: inductive sampling; GAT: attention	Hamilton et al. 2017; Velickovic et al. 2018
05	GNN Augmentation	Virtual nodes, edge features, graph pooling	Hu et al. 2020 (OGB)
06	GNN Theory I	WL test as expressiveness ceiling; GIN maximizes it	Xu et al. 2019 (GIN)
07	GNN Theory II	Graph Transformers; positional encodings	Rampasek et al. 2022 (GPS)
08	Transformers on Graphs	Attention over all pairs; long-range dependencies	Kreuzer et al. 2021
09	Heterogeneous Graphs	Typed nodes/edges; R-GCN; HGT	Schlichtkrull et al. 2018 (R-GCN)
10	Knowledge Graphs	TransE, RotatE: geometric KG embeddings	Bordes et al. 2013; Sun et al. 2019
11	Recommender Systems	LightGCN: stripped GCN for bipartite user-item	He et al. 2020 (LightGCN)
12	Relational DL	RelBench: multi-table ML as GNN on join graphs	Hu et al. 2024 (RelBench)
13	RDL Advanced	Temporal join graphs; feature engineering automation	—
14	Advanced Topics	Equivariance, symmetry, SE(3)-GNNs	Thomas et al. 2018 (TFN)
15	KG Foundation Models	ULTRA: universal KG reasoning via relation graphs	Galkin et al. 2023 (ULTRA)
16	LLM + GNN	Cascade and joint integration of language + structure	Chen et al. 2024 (LLaGA)
17	Agents + Graphs	ReAct/Reflexion on KGs; RL for agent optimization	Yao et al. 2023; Sun et al. 2024
18	Graph Generation	GraphRNN, GCPN, JT-VAE, DiGress	You et al. 2018; Vignac et al. 2023
19	Conclusion	The unified arc from embeddings to foundation models	You are here.

Conclusion