CS224W Lecture 19

Conclusion

We started with a simple question: how do you learn from data that has structure? Nineteen lectures later, you have the full answer — and the tools to push the frontier further.

Prerequisites: All prior CS224W lectures. This is the synthesis.
8
Chapters
4+
Simulations
19
Lectures Covered

Chapter 0: The Journey

You started the course with one core observation: most interesting data isn't a grid of pixels or a linear sequence of words. It's a web of relationships. Proteins interact with proteins. Users connect to items. Atoms bond to atoms. Papers cite papers. Genes regulate genes. The graph is the native data structure for this kind of knowledge.

The central challenge: how do you make a neural network that respects graph structure? You can't unroll a graph into a vector without destroying the structure. You can't apply convolution (which assumes a grid) to an irregular graph. The field needed a new paradigm — and that paradigm is message passing.

The Arc of the Course

Node Embeddings (Lec 2)
The first attempt: DeepWalk, Node2Vec. Learn low-dimensional vectors for nodes by random-walking the graph. Works, but doesn't generalize to new graphs or use node features.
GNN: Message Passing (Lec 3-5)
The core invention: nodes send messages along edges, aggregate neighbor information, and update their own representation. GCN, GraphSAGE, GAT. Generalizes across graphs; uses features. The rest of the course builds on this foundation.
Theory (Lec 6-7)
How expressive are GNNs? The WL isomorphism test. GINs that are maximally expressive. Graph Transformers that go beyond WL via positional encodings. Understanding the limits.
Applications (Lec 8-16)
Heterogeneous graphs, knowledge graphs, recommender systems, relational deep learning, large-scale deployment, LLM+GNN integration. Real-world engineering at scale.
Frontier (Lec 17-18)
Agents on graphs. Generative models for graphs. The field moves from prediction to reasoning and creation.
The through-line: Every lecture answers the same question differently — how do we learn from structure? The answer keeps evolving: shallow embeddings → learned aggregation → expressive GNNs → scalable frameworks → hybrid LLM+GNN → generative models. Each generation solves the limitations of the previous one.
The Course Timeline: Expressive Power vs. Scale

Plotting the major methods by their expressive power (how much graph structure they capture) and scale (how large a graph they can handle). Click a point to see the lecture.

What fundamental limitation of shallow node embeddings (DeepWalk, Node2Vec) motivated the development of message-passing GNNs?

Chapter 1: The Core Ideas

Everything in graph ML reduces to a small set of ideas. If you understand these deeply, you can derive the rest. Here they are, one by one.

1. Message Passing: The Universal Primitive

At every layer l, each node v aggregates messages from its neighbors N(v), then updates its own representation:

mv(l) = AGG({hu(l-1) : u ∈ N(v)})
hv(l) = UPD(hv(l-1), mv(l))

The choice of AGG (sum, mean, max, attention-weighted) and UPD (MLP, GRU) defines the GNN variant. Sum = GIN (most expressive). Mean = GCN (normalizes by degree). Attention-weighted = GAT (learnable importance). The framework unifies hundreds of papers.

2. The WL Hierarchy: Expressiveness Has a Ceiling

The Weisfeiler-Lehman graph isomorphism test is the theoretical ceiling for message-passing GNNs. Two graphs that fool the 1-WL test (assign the same color sequence) are indistinguishable by any message-passing GNN, no matter how deep or wide. Regular graphs of the same size fool 1-WL. Going beyond requires higher-order WL (k-WL) or structural encodings.

The practical consequence: GCN, GraphSAGE, GAT — they're all sub-WL (mean aggregation) or WL-equivalent (sum aggregation). If your task requires distinguishing structures that 1-WL cannot tell apart (e.g., whether a graph has a 5-cycle vs. 6-cycle), standard GNNs will fail. You need graph transformers or subgraph GNNs.

3. Positional Encoding: Breaking Symmetry

Without positional encoding, all nodes in a symmetric graph get the same embedding. There's no way to distinguish node 3 from node 7 in a regular graph — they look identical to the GNN. Positional encodings break this symmetry by giving each node a unique structural signature: random walk landing probabilities, Laplacian eigenvectors, or distance to anchor nodes.

4. Graph-Level Pooling: From Nodes to Graphs

Many tasks need a graph-level representation (predict if this molecule is toxic). Graph pooling aggregates node representations into a single graph vector. Simple: sum/mean all node embeddings. Better: hierarchical pooling (DiffPool, MinCutPool) that mimics the hierarchical structure of real graphs.

Message Passing: One Layer Visualized

Watch one layer of GNN message passing. Each node collects colored messages from its neighbors, aggregates them, and produces a new representation. Click "Next Layer" to see the representations after aggregation.

Why does mean aggregation (as used by GCN) make the GNN strictly less expressive than sum aggregation (GIN)?

Chapter 2: The Application Landscape

Graph ML is not one field solving one problem — it's a general framework that has been successfully applied across six major domains. Each domain has different graph types, different tasks, and different engineering challenges.

DomainGraph typeTaskKey methodImpact
Drug discoveryMolecular graphs (atoms = nodes, bonds = edges)Property prediction, generationMPNN, JT-VAE, GCPNInsilico: Phase II drug in 18 months
Social networksUser-user, user-item bipartiteLink prediction, recommendationGraphSAGE, LightGCN, PinSagePinterest: 150M MAU recommendations
Knowledge graphsEntity-relation-entity triplesLink prediction, Q&ATransE, RotatE, GNN-KG agentsWikidata query answering
Recommender systemsUser-item interaction bipartiteRating prediction, rankingNGCF, LightGCN, IDCFDeployed at Netflix, Amazon, Alibaba
Relational databasesTable join graphsPrediction over multi-table dataRelBench, HeteroGNNCross-table ML without feature engineering
3D structurePoint clouds + k-NN graphsProtein structure, material propertiesSE(3)-GNN, AlphaFold, GVPProtein folding solved; material design
The unifying observation: In every domain, the graph captures relationships that flat feature vectors cannot. A drug's properties depend not just on which atoms are present, but on how they're connected. A user's preferences depend not just on their history, but on who they're connected to. The graph is the missing context.
Domain Coverage: Graph Type vs. Scale

Each domain has a characteristic scale (nodes) and graph type. Click a domain to see its typical graph structure and key challenge.

AlphaFold 2 (protein structure prediction) doesn't explicitly use a standard GNN. But it uses message-passing-like attention between residue pairs. What "graph" is it implicitly operating on?

Chapter 3: What Changed (2017 → 2025)

The field of graph ML has changed completely in eight years. Not just incremental improvement — a fundamental shift in what's possible and how we think about the problem.

2017: The Beginning

Kipf and Welling publish GCN (December 2016/published 2017). It's a 2-layer network on the Cora citation graph, 1433 node features, 7 classes, 2708 nodes. Training takes seconds. The key insight: spectral graph convolution can be approximated by simple first-order neighborhood aggregation. This creates the field.

GCN (2017) in one equation: H(l+1) = σ(D̃ à D̃ H(l) W(l)) — where à = A + I (add self-loops), D̃ is the degree matrix. That's it. The entire model.

The Scaling Inflection (2019-2022)

OGB (Open Graph Benchmark) launches in 2020 — standard large-scale benchmarks. ogbn-products (2.4M nodes), ogbn-papers100M (111M nodes). Suddenly the question isn't "does it work on Cora?" but "does it work at web scale?" Mini-batch training, neighbor sampling (GraphSAGE), cluster-GCN, SIGN emerge to handle the scale challenge.

The Foundation Model Era (2023-2025)

The most dramatic shift: graph foundation models. Instead of training a separate GNN for each task and graph, can we train one massive model that transfers across graphs? This mirrors what GPT-3 did for text: one model, many tasks, zero-shot generalization.

The core challenge: Text has a fixed vocabulary (50k tokens). Graphs don't — every domain has different node/edge types (atoms vs. papers vs. users). How do you build a universal tokenizer for graphs?
Current approaches: (1) Use text descriptions as features (LLM-encoded), unifying all domains via language. (2) Learn graph-structural tokens via VQ-VAE on subgraphs. (3) Condition on graph schema/ontology. None has fully solved the problem yet — it's the open frontier.
YearMilestoneKey innovationScale
2016/17GCNSpectral → spatial approximation2.7K nodes
2017GraphSAGEInductive learning, neighbor sampling100K nodes
2018GAT, GINAttention aggregation; expressiveness theory
2019PyG 1.0Standard implementation framework
2020OGB, DGL-KELarge-scale benchmarks; KG embeddings111M nodes
2022Graph TransformersGlobal attention + positional encoding
2023LLM+GNN, RelBenchText-rich graphs, relational deep learning
2024Graph Foundation ModelsCross-graph transfer learningEmerging
What is the core challenge in building a "graph foundation model" analogous to GPT-3 for text?

Chapter 4: Open Problems

The most important thing a good course can give you is not just what we know, but what we don't know. Here are the four hardest open problems in graph ML — problems where a PhD thesis could make a real contribution.

Open Problem 1: Scalability to Billion-Node Graphs

The web-scale graph (Facebook social graph: 3B nodes, 300B edges) is too large for any current full-batch GNN. Even mini-batch methods with neighbor sampling struggle: sampling 2-hop neighborhoods for batch size 1000 might touch millions of nodes. Current workarounds (GraphSAGE, cluster-GCN, SIGN) all introduce approximation errors. A principled approach to exact inference on billion-node graphs doesn't exist yet.

Open Problem 2: Expressiveness Beyond 1-WL Without Cubic Cost

We know 1-WL GNNs can't distinguish certain graph pairs (regular graphs, triangles vs. 4-cycles). Higher-order WL (2-WL, 3-WL) is strictly more expressive but costs O(n²) or O(n³) per layer — unusable at scale. Is there a way to get k-WL expressiveness with O(n) cost? Some recent work (subgraph GNNs, DS-WL) takes steps toward this, but there's no clean answer yet.

Open Problem 3: Dynamic Graphs

Real graphs change over time. Twitter's follow graph changes every second. A protein interaction network changes during cell differentiation. Current GNNs are static: they're trained on a snapshot and can't efficiently update when edges/nodes appear or disappear. Temporal GNNs (TGAT, TGN) exist but are expensive. Efficient streaming graph learning remains open.

Open Problem 4: Interpretability

When a GNN predicts that a molecule is toxic, which structural patterns caused that prediction? GNNExplainer, PGExplainer, and similar methods attempt to find important subgraphs, but their fidelity is limited: the "explanation" doesn't always faithfully reflect the model's actual computation. For high-stakes domains (drug safety, credit scoring), this is a blocker for deployment.

Open Problem Landscape

Click an open problem to see its current state, best partial solutions, and what a breakthrough would unlock.

A GNN trained on a protein interaction graph snapshot from 2022 needs to predict interactions for newly discovered proteins in 2024. What open problem does this expose?

Chapter 5: The Research Frontier

Three convergences are defining the next five years of graph ML research. Each combines graph learning with a currently dominant paradigm from the broader AI field.

Frontier 1: Graph + LLM

As covered in Lecture 16: LLMs give rich semantic understanding of node text; GNNs give structural reasoning. The question for 2025-2027: can we build a unified architecture that processes both modalities without treating them as separate? Graph Tokens approaches (mapping graph substructures to special tokens in the LLM's vocabulary) are the leading direction. If successful, this could enable LLMs to "think in graphs" natively.

The dream: A single model that, given a question about a knowledge graph, can seamlessly alternate between language reasoning ("Imatinib is a kinase inhibitor...") and graph traversal ("...connected to BCR-ABL1 by targets relation → hop to Dasatinib") without explicit tool calls. End-to-end differentiable graph-language reasoning.

Frontier 2: Graph + Diffusion

As covered in Lecture 18: diffusion models for graph generation are achieving state-of-the-art on molecular benchmarks. The open question: can these scale to protein-level graphs (thousands of nodes), and can the diffusion process be conditioned on arbitrary molecular properties rather than just property scores? Structure-based drug design (conditioning on 3D protein pocket geometry) is the target application.

Frontier 3: Graph + RL

As covered in Lecture 17: RL for agent optimization on knowledge graphs. Beyond this: using graphs as the world model in RL. The environment's causal structure (which variables affect which other variables) is a graph. An agent that can learn and exploit this causal graph will be dramatically more sample-efficient than one that doesn't.

The Grand Unified Vision: A foundation model that knows the graph of knowledge (KG), can reason over it in language (LLM), can generate new entities and relations (diffusion), and can learn optimal strategies for navigating and extending it (RL). This is not science fiction — it's the direction every major lab is moving toward.
ConvergenceCurrent stateKey open questionHorizon
Graph + LLMCascade architectures (LLM then GNN)Unified end-to-end architecture2025-26
Graph + DiffusionDiGress/GDSS for small moleculesProtein/material scale; 3D conditioning2025-27
Graph + RLGCPN; agents on KGCausal graph as world model2026-28
Graph Foundation ModelSpecialized per-domainUniversal graph vocabulary2026-28
The "Graph + LLM" research frontier aims for end-to-end differentiable graph-language reasoning. What does "end-to-end differentiable" mean in this context?

Chapter 6: Practical Advice

You've learned the theory. Now, how do you actually use it? When you sit down with a new graph ML problem, which method do you reach for? Here's the decision procedure that experienced practitioners use.

Step 1: What is your task?

Node-level task (predict a property of each node): Use GCN or GraphSAGE for homophilic graphs (neighbors tend to have same label). Use GAT if you want the model to learn which neighbors are important. Use GIN if you need maximum expressiveness. Large graph? Use ClusterGCN or GraphSAGE with neighbor sampling.
Graph-level task (predict a property of the whole graph): Add a global pooling layer after your GNN. Sum pooling + GIN = maximum expressiveness. Mean pooling + GCN = good default. Hierarchical pooling (DiffPool) if the graph has natural hierarchical structure (e.g., molecular scaffolds).
Link prediction task (predict whether an edge exists): Use node2vec or GraphSAGE embeddings + dot product decoder for simple cases. Use R-GCN or CompGCN for knowledge graphs with typed relations. Use LightGCN for recommender systems (clean, fast, surprisingly strong).

Step 2: What is your graph?

Graph typeRecommended methodWhy
Homogeneous, small (<100K nodes)GCN or GINFast, well-understood, good baselines
Homogeneous, large (>1M nodes)GraphSAGE + mini-batchNeighbor sampling scales linearly
Heterogeneous (typed nodes/edges)R-GCN or HGTSeparate weight matrices per relation type
Knowledge graph (triples)TransE/RotatE + GNNGeometric relation embedding + propagation
Molecular graphMPNN or DMPNNDesigned for chemical graph features
Text-rich node attributesLLM encoder + GNNLLM encodes text; GNN aggregates structure
No node featuresAdd positional encodings firstPE breaks symmetry; any GNN can then learn

Step 3: The PyG Ecosystem

PyTorch Geometric (PyG) is the standard library. It implements 100+ GNN layers, all major datasets, and handles the data loading/batching complexity of graphs. Start with torch_geometric.nn.GCNConv or SAGEConv, compose layers, add a linear head, train with standard PyTorch. For large-scale graphs: DGL (Deep Graph Library) has better distributed training support.

PyG: minimal GNN classifier
import torch
from torch_geometric.nn import GCNConv, global_mean_pool
import torch.nn.functional as F

class SimpleGNN(torch.nn.Module):
    def __init__(self, in_dim, hidden, out_dim):
        super().__init__()
        self.conv1 = GCNConv(in_dim, hidden)
        self.conv2 = GCNConv(hidden, hidden)
        self.head  = torch.nn.Linear(hidden, out_dim)

    def forward(self, x, edge_index, batch):
        x = F.relu(self.conv1(x, edge_index))   # Layer 1: aggregate
        x = F.relu(self.conv2(x, edge_index))   # Layer 2: aggregate
        x = global_mean_pool(x, batch)          # Graph pooling
        return self.head(x)                      # Classification head

# That's a complete graph classifier. ~15 lines of real PyG code.
# Swap GCNConv → GINConv for max expressiveness.
# Swap GCNConv → GATConv for attention aggregation.
You have a graph with 50M nodes, no node features, and you need node-level classification. Which combination best addresses both scale and the missing features?

Chapter 7: Connections — All 19 Lectures

Here is the complete map of CS224W. Every lecture, in order, with its core contribution and the one key paper to read if you want to go deeper.

LecTopicCore contributionKey paper
01IntroductionGraphs are the right abstraction for relational dataLeskovec et al., "Snap"
02Node EmbeddingsDeepWalk, Node2Vec: random walk → embeddingPerozzi et al. 2014; Grover & Leskovec 2016
03GNN BasicsGCN: spectral → spatial aggregationKipf & Welling 2017
04GNN DesignGraphSAGE: inductive sampling; GAT: attentionHamilton et al. 2017; Velickovic et al. 2018
05GNN AugmentationVirtual nodes, edge features, graph poolingHu et al. 2020 (OGB)
06GNN Theory IWL test as expressiveness ceiling; GIN maximizes itXu et al. 2019 (GIN)
07GNN Theory IIGraph Transformers; positional encodingsRampasek et al. 2022 (GPS)
08Transformers on GraphsAttention over all pairs; long-range dependenciesKreuzer et al. 2021
09Heterogeneous GraphsTyped nodes/edges; R-GCN; HGTSchlichtkrull et al. 2018 (R-GCN)
10Knowledge GraphsTransE, RotatE: geometric KG embeddingsBordes et al. 2013; Sun et al. 2019
11Recommender SystemsLightGCN: stripped GCN for bipartite user-itemHe et al. 2020 (LightGCN)
12Relational DLRelBench: multi-table ML as GNN on join graphsHu et al. 2024 (RelBench)
13RDL AdvancedTemporal join graphs; feature engineering automation
14Advanced TopicsEquivariance, symmetry, SE(3)-GNNsThomas et al. 2018 (TFN)
15KG Foundation ModelsULTRA: universal KG reasoning via relation graphsGalkin et al. 2023 (ULTRA)
16LLM + GNNCascade and joint integration of language + structureChen et al. 2024 (LLaGA)
17Agents + GraphsReAct/Reflexion on KGs; RL for agent optimizationYao et al. 2023; Sun et al. 2024
18Graph GenerationGraphRNN, GCPN, JT-VAE, DiGressYou et al. 2018; Vignac et al. 2023
19ConclusionThe unified arc from embeddings to foundation modelsYou are here.

Recommended Reading (Beyond the Course)

The Graph ML Knowledge Graph

A graph of the course topics and their connections. Click a node to see what it depends on and what it enables. (The course itself is a graph.)

"The structure of a graph encodes what matters about its domain. To ignore structure is to throw away the most important information. To reason over structure is to understand the domain itself."

— Jure Leskovec
Course Complete
You've covered 19 lectures, 100+ years of research in graph ML, and the tools to contribute to the frontier. The field is yours to advance.
← Back to Gleams