We started with a simple question: how do you learn from data that has structure? Nineteen lectures later, you have the full answer — and the tools to push the frontier further.
You started the course with one core observation: most interesting data isn't a grid of pixels or a linear sequence of words. It's a web of relationships. Proteins interact with proteins. Users connect to items. Atoms bond to atoms. Papers cite papers. Genes regulate genes. The graph is the native data structure for this kind of knowledge.
The central challenge: how do you make a neural network that respects graph structure? You can't unroll a graph into a vector without destroying the structure. You can't apply convolution (which assumes a grid) to an irregular graph. The field needed a new paradigm — and that paradigm is message passing.
Plotting the major methods by their expressive power (how much graph structure they capture) and scale (how large a graph they can handle). Click a point to see the lecture.
Everything in graph ML reduces to a small set of ideas. If you understand these deeply, you can derive the rest. Here they are, one by one.
At every layer l, each node v aggregates messages from its neighbors N(v), then updates its own representation:
The choice of AGG (sum, mean, max, attention-weighted) and UPD (MLP, GRU) defines the GNN variant. Sum = GIN (most expressive). Mean = GCN (normalizes by degree). Attention-weighted = GAT (learnable importance). The framework unifies hundreds of papers.
The Weisfeiler-Lehman graph isomorphism test is the theoretical ceiling for message-passing GNNs. Two graphs that fool the 1-WL test (assign the same color sequence) are indistinguishable by any message-passing GNN, no matter how deep or wide. Regular graphs of the same size fool 1-WL. Going beyond requires higher-order WL (k-WL) or structural encodings.
Without positional encoding, all nodes in a symmetric graph get the same embedding. There's no way to distinguish node 3 from node 7 in a regular graph — they look identical to the GNN. Positional encodings break this symmetry by giving each node a unique structural signature: random walk landing probabilities, Laplacian eigenvectors, or distance to anchor nodes.
Many tasks need a graph-level representation (predict if this molecule is toxic). Graph pooling aggregates node representations into a single graph vector. Simple: sum/mean all node embeddings. Better: hierarchical pooling (DiffPool, MinCutPool) that mimics the hierarchical structure of real graphs.
Watch one layer of GNN message passing. Each node collects colored messages from its neighbors, aggregates them, and produces a new representation. Click "Next Layer" to see the representations after aggregation.
Graph ML is not one field solving one problem — it's a general framework that has been successfully applied across six major domains. Each domain has different graph types, different tasks, and different engineering challenges.
| Domain | Graph type | Task | Key method | Impact |
|---|---|---|---|---|
| Drug discovery | Molecular graphs (atoms = nodes, bonds = edges) | Property prediction, generation | MPNN, JT-VAE, GCPN | Insilico: Phase II drug in 18 months |
| Social networks | User-user, user-item bipartite | Link prediction, recommendation | GraphSAGE, LightGCN, PinSage | Pinterest: 150M MAU recommendations |
| Knowledge graphs | Entity-relation-entity triples | Link prediction, Q&A | TransE, RotatE, GNN-KG agents | Wikidata query answering |
| Recommender systems | User-item interaction bipartite | Rating prediction, ranking | NGCF, LightGCN, IDCF | Deployed at Netflix, Amazon, Alibaba |
| Relational databases | Table join graphs | Prediction over multi-table data | RelBench, HeteroGNN | Cross-table ML without feature engineering |
| 3D structure | Point clouds + k-NN graphs | Protein structure, material properties | SE(3)-GNN, AlphaFold, GVP | Protein folding solved; material design |
Each domain has a characteristic scale (nodes) and graph type. Click a domain to see its typical graph structure and key challenge.
The field of graph ML has changed completely in eight years. Not just incremental improvement — a fundamental shift in what's possible and how we think about the problem.
Kipf and Welling publish GCN (December 2016/published 2017). It's a 2-layer network on the Cora citation graph, 1433 node features, 7 classes, 2708 nodes. Training takes seconds. The key insight: spectral graph convolution can be approximated by simple first-order neighborhood aggregation. This creates the field.
OGB (Open Graph Benchmark) launches in 2020 — standard large-scale benchmarks. ogbn-products (2.4M nodes), ogbn-papers100M (111M nodes). Suddenly the question isn't "does it work on Cora?" but "does it work at web scale?" Mini-batch training, neighbor sampling (GraphSAGE), cluster-GCN, SIGN emerge to handle the scale challenge.
The most dramatic shift: graph foundation models. Instead of training a separate GNN for each task and graph, can we train one massive model that transfers across graphs? This mirrors what GPT-3 did for text: one model, many tasks, zero-shot generalization.
| Year | Milestone | Key innovation | Scale |
|---|---|---|---|
| 2016/17 | GCN | Spectral → spatial approximation | 2.7K nodes |
| 2017 | GraphSAGE | Inductive learning, neighbor sampling | 100K nodes |
| 2018 | GAT, GIN | Attention aggregation; expressiveness theory | — |
| 2019 | PyG 1.0 | Standard implementation framework | — |
| 2020 | OGB, DGL-KE | Large-scale benchmarks; KG embeddings | 111M nodes |
| 2022 | Graph Transformers | Global attention + positional encoding | — |
| 2023 | LLM+GNN, RelBench | Text-rich graphs, relational deep learning | — |
| 2024 | Graph Foundation Models | Cross-graph transfer learning | Emerging |
The most important thing a good course can give you is not just what we know, but what we don't know. Here are the four hardest open problems in graph ML — problems where a PhD thesis could make a real contribution.
The web-scale graph (Facebook social graph: 3B nodes, 300B edges) is too large for any current full-batch GNN. Even mini-batch methods with neighbor sampling struggle: sampling 2-hop neighborhoods for batch size 1000 might touch millions of nodes. Current workarounds (GraphSAGE, cluster-GCN, SIGN) all introduce approximation errors. A principled approach to exact inference on billion-node graphs doesn't exist yet.
We know 1-WL GNNs can't distinguish certain graph pairs (regular graphs, triangles vs. 4-cycles). Higher-order WL (2-WL, 3-WL) is strictly more expressive but costs O(n²) or O(n³) per layer — unusable at scale. Is there a way to get k-WL expressiveness with O(n) cost? Some recent work (subgraph GNNs, DS-WL) takes steps toward this, but there's no clean answer yet.
Real graphs change over time. Twitter's follow graph changes every second. A protein interaction network changes during cell differentiation. Current GNNs are static: they're trained on a snapshot and can't efficiently update when edges/nodes appear or disappear. Temporal GNNs (TGAT, TGN) exist but are expensive. Efficient streaming graph learning remains open.
When a GNN predicts that a molecule is toxic, which structural patterns caused that prediction? GNNExplainer, PGExplainer, and similar methods attempt to find important subgraphs, but their fidelity is limited: the "explanation" doesn't always faithfully reflect the model's actual computation. For high-stakes domains (drug safety, credit scoring), this is a blocker for deployment.
Click an open problem to see its current state, best partial solutions, and what a breakthrough would unlock.
Three convergences are defining the next five years of graph ML research. Each combines graph learning with a currently dominant paradigm from the broader AI field.
As covered in Lecture 16: LLMs give rich semantic understanding of node text; GNNs give structural reasoning. The question for 2025-2027: can we build a unified architecture that processes both modalities without treating them as separate? Graph Tokens approaches (mapping graph substructures to special tokens in the LLM's vocabulary) are the leading direction. If successful, this could enable LLMs to "think in graphs" natively.
As covered in Lecture 18: diffusion models for graph generation are achieving state-of-the-art on molecular benchmarks. The open question: can these scale to protein-level graphs (thousands of nodes), and can the diffusion process be conditioned on arbitrary molecular properties rather than just property scores? Structure-based drug design (conditioning on 3D protein pocket geometry) is the target application.
As covered in Lecture 17: RL for agent optimization on knowledge graphs. Beyond this: using graphs as the world model in RL. The environment's causal structure (which variables affect which other variables) is a graph. An agent that can learn and exploit this causal graph will be dramatically more sample-efficient than one that doesn't.
| Convergence | Current state | Key open question | Horizon |
|---|---|---|---|
| Graph + LLM | Cascade architectures (LLM then GNN) | Unified end-to-end architecture | 2025-26 |
| Graph + Diffusion | DiGress/GDSS for small molecules | Protein/material scale; 3D conditioning | 2025-27 |
| Graph + RL | GCPN; agents on KG | Causal graph as world model | 2026-28 |
| Graph Foundation Model | Specialized per-domain | Universal graph vocabulary | 2026-28 |
You've learned the theory. Now, how do you actually use it? When you sit down with a new graph ML problem, which method do you reach for? Here's the decision procedure that experienced practitioners use.
| Graph type | Recommended method | Why |
|---|---|---|
| Homogeneous, small (<100K nodes) | GCN or GIN | Fast, well-understood, good baselines |
| Homogeneous, large (>1M nodes) | GraphSAGE + mini-batch | Neighbor sampling scales linearly |
| Heterogeneous (typed nodes/edges) | R-GCN or HGT | Separate weight matrices per relation type |
| Knowledge graph (triples) | TransE/RotatE + GNN | Geometric relation embedding + propagation |
| Molecular graph | MPNN or DMPNN | Designed for chemical graph features |
| Text-rich node attributes | LLM encoder + GNN | LLM encodes text; GNN aggregates structure |
| No node features | Add positional encodings first | PE breaks symmetry; any GNN can then learn |
PyTorch Geometric (PyG) is the standard library. It implements 100+ GNN layers, all major datasets, and handles the data loading/batching complexity of graphs. Start with torch_geometric.nn.GCNConv or SAGEConv, compose layers, add a linear head, train with standard PyTorch. For large-scale graphs: DGL (Deep Graph Library) has better distributed training support.
PyG: minimal GNN classifier import torch from torch_geometric.nn import GCNConv, global_mean_pool import torch.nn.functional as F class SimpleGNN(torch.nn.Module): def __init__(self, in_dim, hidden, out_dim): super().__init__() self.conv1 = GCNConv(in_dim, hidden) self.conv2 = GCNConv(hidden, hidden) self.head = torch.nn.Linear(hidden, out_dim) def forward(self, x, edge_index, batch): x = F.relu(self.conv1(x, edge_index)) # Layer 1: aggregate x = F.relu(self.conv2(x, edge_index)) # Layer 2: aggregate x = global_mean_pool(x, batch) # Graph pooling return self.head(x) # Classification head # That's a complete graph classifier. ~15 lines of real PyG code. # Swap GCNConv → GINConv for max expressiveness. # Swap GCNConv → GATConv for attention aggregation.
Here is the complete map of CS224W. Every lecture, in order, with its core contribution and the one key paper to read if you want to go deeper.
| Lec | Topic | Core contribution | Key paper |
|---|---|---|---|
| 01 | Introduction | Graphs are the right abstraction for relational data | Leskovec et al., "Snap" |
| 02 | Node Embeddings | DeepWalk, Node2Vec: random walk → embedding | Perozzi et al. 2014; Grover & Leskovec 2016 |
| 03 | GNN Basics | GCN: spectral → spatial aggregation | Kipf & Welling 2017 |
| 04 | GNN Design | GraphSAGE: inductive sampling; GAT: attention | Hamilton et al. 2017; Velickovic et al. 2018 |
| 05 | GNN Augmentation | Virtual nodes, edge features, graph pooling | Hu et al. 2020 (OGB) |
| 06 | GNN Theory I | WL test as expressiveness ceiling; GIN maximizes it | Xu et al. 2019 (GIN) |
| 07 | GNN Theory II | Graph Transformers; positional encodings | Rampasek et al. 2022 (GPS) |
| 08 | Transformers on Graphs | Attention over all pairs; long-range dependencies | Kreuzer et al. 2021 |
| 09 | Heterogeneous Graphs | Typed nodes/edges; R-GCN; HGT | Schlichtkrull et al. 2018 (R-GCN) |
| 10 | Knowledge Graphs | TransE, RotatE: geometric KG embeddings | Bordes et al. 2013; Sun et al. 2019 |
| 11 | Recommender Systems | LightGCN: stripped GCN for bipartite user-item | He et al. 2020 (LightGCN) |
| 12 | Relational DL | RelBench: multi-table ML as GNN on join graphs | Hu et al. 2024 (RelBench) |
| 13 | RDL Advanced | Temporal join graphs; feature engineering automation | — |
| 14 | Advanced Topics | Equivariance, symmetry, SE(3)-GNNs | Thomas et al. 2018 (TFN) |
| 15 | KG Foundation Models | ULTRA: universal KG reasoning via relation graphs | Galkin et al. 2023 (ULTRA) |
| 16 | LLM + GNN | Cascade and joint integration of language + structure | Chen et al. 2024 (LLaGA) |
| 17 | Agents + Graphs | ReAct/Reflexion on KGs; RL for agent optimization | Yao et al. 2023; Sun et al. 2024 |
| 18 | Graph Generation | GraphRNN, GCPN, JT-VAE, DiGress | You et al. 2018; Vignac et al. 2023 |
| 19 | Conclusion | The unified arc from embeddings to foundation models | You are here. |
A graph of the course topics and their connections. Click a node to see what it depends on and what it enables. (The course itself is a graph.)
"The structure of a graph encodes what matters about its domain. To ignore structure is to throw away the most important information. To reason over structure is to understand the domain itself."
— Jure Leskovec