CS224W Lecture 1 — Introduction to Graph ML

Chapter 0: Why Graphs?

Consider a molecule. You could describe it as a list of atoms: carbon, carbon, hydrogen, oxygen. But that list tells you almost nothing useful. What makes a molecule a drug, or a poison, or an antibiotic, is not which atoms are present — it's how they're connected. Change one bond, and you get a completely different compound with completely different properties.

This is the central insight behind graph machine learning: relational structure — who connects to whom — carries information that can't be captured by treating entities as an independent list. Images have pixel grids. Text has sequences. But molecules, social networks, proteins, and knowledge bases are fundamentally graphs: collections of nodes (entities) connected by edges (relationships).

The key insight: A molecule isn't just a list of atoms. A social network isn't just a list of users. A protein isn't just a sequence of amino acids. The structure — the pattern of connections — is itself the data. Graph ML reads that structure.

Graphs are everywhere. Here's a sampling of the domains that CS224W covers:

Domain	Nodes	Edges	Example Task
Social networks	Users	Friendships	Detect bots
Biology	Proteins	Interactions	Predict function
Chemistry	Atoms	Bonds	Predict drug activity
Knowledge bases	Entities	Relations	Answer questions
Road networks	Intersections	Roads	Estimate travel time
Citation networks	Papers	Citations	Classify research area
Recommenders	Users + Items	Interactions	Suggest next item

Standard ML tools — CNNs, RNNs, transformers — are designed for regular structure: grids, sequences, fixed-size vectors. Graphs are irregular: each node has a different number of neighbors, and there's no natural ordering of nodes. We need new tools built from the ground up for this structure.

What this course covers: Node embeddings (DeepWalk, node2vec) → GNNs (GCN, GraphSAGE, GAT) → Graph Transformers → Knowledge Graphs → Generative Models for graphs → Applications at scale. Lectures 2–19 build on this foundation.

Graph Explorer

Click the canvas to add nodes. Click two nodes in sequence to add an edge. Notice that the graph carries structural information you can't see in a plain list of nodes.

Nodes: 0 | Edges: 0 | Avg degree: —

Why can't we just describe a molecule as a list of atoms and run standard ML on it?

Molecules have too many atoms for ML to handle The list loses relational structure — how atoms connect determines properties Standard ML can handle molecules fine; graphs are just a niche alternative

Chapter 1: Graphs as Data Structures

Before we can do machine learning on graphs, we need a precise language for describing them. A graph G = (V, E) has two ingredients: a set V of vertices (also called nodes) and a set E of edges (also called links). An edge (i, j) says that node i and node j are related.

Graphs come in several flavors. An undirected graph treats edges as symmetric: if i is connected to j, then j is connected to i (friendship). A directed graph treats edges as one-way arrows: i follows j on Twitter, but j may not follow back. Edges can also carry weights — road distance, interaction strength, confidence score.

Analogy: Think of an undirected graph as a mutual agreement ("we're friends") and a directed graph as a one-sided claim ("I follow you"). The adjacency matrix encodes exactly this asymmetry.

The Adjacency Matrix

The most compact way to represent a graph is an adjacency matrix A. For a graph with n nodes, A is an n×n matrix where A[i,j] = 1 if edge (i,j) exists, 0 otherwise. For undirected graphs, A is symmetric: A[i,j] = A[j,i]. For weighted graphs, A[i,j] holds the weight instead of 1.

A[i,j] = 1 if (i,j) ∈ E, else 0 (unweighted, undirected: A = A^T)

The degree k_v of a node v is simply the number of edges connected to it. For an undirected graph, k_v = ∑_j A[v,j] — the sum of row v in the adjacency matrix. A node with many edges is a hub; one with few is a peripheral node.

Real Networks Have Skewed Degree Distributions

In random graphs, most nodes have similar degrees. But in real networks — the Web, Twitter, protein interactions — the degree distribution P(k) follows a power law: P(k) ∝ k^−α. Most nodes have very few connections, and a tiny number of hubs have enormous connectivity. This "scale-free" property has profound implications for how information spreads, how robust a network is, and which nodes to prioritize in learning.

Power law intuition: Most Twitter users have dozens of followers. A handful — @BarackObama, @elonmusk — have hundreds of millions. That's a power law. Standard ML on i.i.d. data assumes all datapoints are similar; graphs violate that assumption completely.

Bipartite Graphs

A bipartite graph has two types of nodes, with edges only allowed between types (not within). Users and movies (Netflix), drugs and proteins (pharmacology), papers and authors (citations). Bipartite graphs are natural for recommendation problems and for representing typed knowledge.

Adjacency Matrix Explorer

Toggle directed/undirected to see how the adjacency matrix changes. In undirected mode, the matrix is symmetric. In directed mode, A[i,j] ≠ A[j,i] is possible.

For an undirected graph, what must be true about the adjacency matrix A?

A must be all zeros on the diagonal A must be symmetric: A[i,j] = A[j,i] A must contain only 0s and 1s

Chapter 2: Three Levels of Graph ML Tasks

Machine learning on graphs isn't one problem — it's three, operating at different scales. You can make predictions about individual nodes, about individual edges, or about whole graphs. Each level has its own applications, its own loss functions, and its own challenges.

The three levels also require different amounts of information. Node-level tasks only need local structure (who are my neighbors?). Edge-level tasks need pairs of nodes. Graph-level tasks need a global view of the entire graph. GNNs are designed to efficiently compute all three by passing messages through the graph.

Level 1: Node Classification

Given a graph, label each node. Examples: classify each user in a social network as bot or human; classify each paper in a citation network by research area; predict which amino acids in a protein sequence have which structural role. The graph structure provides context — a user's neighbors reveal a lot about the user.

Level 2: Link Prediction

Given a graph, predict which edges should exist but are missing (or which will exist in the future). Examples: recommend items to users (the "link" between user and item); predict drug-drug interactions; complete a knowledge graph. Formulated as: given nodes i and j, score the likelihood that edge (i,j) exists.

Level 3: Graph Classification

Given a graph (or a set of graphs), classify the whole graph as one entity. Examples: classify a molecule as a drug candidate or not; classify a program's execution graph as buggy or correct; classify a network as a brain scan from a healthy or diseased patient. The graph itself is one datapoint.

Which level is hardest? Graph classification requires aggregating information across all nodes into a single fixed-size vector — called graph pooling. This is the hardest problem: graphs have variable sizes and no natural ordering, so pooling must be permutation-invariant. Node classification is the most studied; edge-level tasks drive the biggest industry applications.

Three Levels: Interactive View

Click a task type to highlight what the ML model must predict in this graph.

Select a task above.

AlphaFold predicts the 3D coordinates of every amino acid in a protein. Which task level is this?

Node-level (predict a property of each node) Edge-level (predict connections between nodes) Graph-level (predict a property of the whole graph)

Chapter 3: Node-Level — AlphaFold

The most dramatic example of node-level graph ML is AlphaFold 2. Proteins are chains of amino acids, and for 50 years the central challenge in biology was: given the sequence, what is the 3D shape? Shape determines function. A misfolded protein causes diseases from Alzheimer's to Parkinson's.

AlphaFold solves this as a node-level prediction problem. Nodes are amino acids. Edges connect amino acids that are spatially close in 3D (or nearby in the sequence). The task: predict the (x, y, z) coordinate of each node. That's node regression — each node gets a continuous output instead of a discrete label.

The graph construction trick: AlphaFold builds a "spatial graph" where edges connect amino acids within a distance threshold. But we don't know the 3D structure yet — that's what we're predicting! So AlphaFold alternates between updating the graph (based on predicted positions) and updating the predictions (based on the graph). This iterative refinement is called the "Evoformer."

What Goes In, What Comes Out

Input: a sequence of amino acids (encoded as integer IDs) + a multiple sequence alignment (how this protein compares to evolutionary relatives). Output: (x, y, z) for every amino acid, plus a per-residue confidence score (pLDDT).

Component	Type	Shape
Amino acid sequence	Node features	[L] integers (L = sequence length)
MSA profile	Node features	[L, 22] (22 amino acid types)
Proximity edges	Edges	Sparse [L, L] adjacency
Predicted coordinates	Node output	[L, 3] floats (x, y, z)
pLDDT confidence	Node output	[L] floats in [0, 100]

AlphaFold 2 (2021) achieved median GDT score > 90 on CASP14 — matching experimental accuracy. It has since predicted structures for 200 million proteins, covering essentially all of known biology. This is what node-level graph ML looks like at its most impactful.

Why graph ML, not an RNN? A protein sequence looks like a string, so why not use an RNN or transformer on the sequence directly? Because sequence distance and spatial distance are different. Amino acids far apart in sequence can be neighbors in 3D space. The graph encodes spatial proximity, not sequential proximity.

Protein Spatial Graph

A simplified protein fragment. Nodes are amino acids. Edges connect amino acids within a distance threshold. Drag the slider to change the distance cutoff and see how the graph changes.

Distance cutoff 2.5

In AlphaFold's spatial graph, what do the edges represent?

Chemical bonds between amino acids Spatial proximity — amino acids that are close in 3D space Sequential adjacency in the amino acid chain

Chapter 4: Edge-Level — Recommenders & Drug Interactions

You're on Pinterest looking at a wedding cake. The system immediately suggests 12 more wedding cakes, 3 florists, and a calligraphy invite designer. How? Not because those items look similar in pixel space, but because millions of users pinned them to the same boards. The connections reveal similarity.

This is PinSage (Ying et al., KDD 2018), a graph-based recommender system running at Pinterest scale: 3 billion nodes, 18 billion edges. The graph connects pins (images) to boards, and boards to users. Two pins on the same board are implicitly linked. The task: given a query pin, find pins with similar embeddings z_i such that:

d(z_cake1, z_cake2) < d(z_cake1, z_sweater)

The embeddings z_i are learned by a GNN that aggregates information from each pin's neighborhood in the graph. No hand-designed features. The graph structure alone teaches the model what "similar" means.

Drug-Drug Interaction Prediction

A second edge-level problem: given two drugs, predict what adverse side effects occur when they're taken together (Zitnik et al., 2018). The graph has two types of nodes — drugs and proteins — and two types of edges: drug-protein interactions (from pharmacology databases) and drug-drug edges for known interactions. The task is to predict new drug-drug edges and label them with specific side effects.

Why is this hard without graphs? Two drugs might never directly interact, but they both interact with the same protein, which triggers a cascade. You need to reason through the protein graph to find indirect interactions. A flat feature vector for each drug can't represent this multi-hop reasoning.

Mini Recommender System

A bipartite graph: users (circles, left) and items (squares, right) connected by interactions. Select a user and a hop depth to see recommended items (those reachable within that many hops through shared connections).

Rec. depth 2-hop

Select a user to see recommendations.

In PinSage, two pins are considered similar if they appear on the same boards (even if they look nothing alike). What does this reveal about edge-level graph ML?

Graph ML is just a fancy name for nearest-neighbor search Similarity must be defined by visual features Relational context (shared connections) defines similarity better than raw features alone

Chapter 5: Graph-Level — Molecular Properties

In 2020, researchers at MIT and the Broad Institute trained a graph neural network on ~2,500 known antibiotics and asked: of the 6,000 compounds in a library of FDA-approved drugs and natural products, which ones might work against drug-resistant bacteria? The model predicted halicin. Lab tests confirmed it was effective against organisms that resist every known antibiotic, including M. tuberculosis. (Stokes et al., Cell, 2020.)

This is graph-level prediction. Each molecule is one graph. Nodes are atoms; edges are chemical bonds. The model reads the whole graph and outputs a single number: antibiotic activity. The challenge is converting a variable-size graph into a fixed-size vector — a problem called graph pooling.

How Graph Pooling Works

The simplest approach: mean pooling. Compute a feature vector h_v for each node (via GNN message passing), then average them across all nodes. The resulting vector h_G = mean({h_v}) is a fixed-size representation of the whole graph, regardless of how many nodes it has.

h_G = 1⁄ |V| ∑_v∈V h_v then ŷ = σ(W · h_G + b)

Why not just count atoms? Two molecules can have identical atom counts (same molecular formula) but completely different properties. Isomers: benzene and Dewar benzene are both C₆H₆ but have radically different stability and reactivity. The graph structure (who bonds to whom) is what matters.

Molecule Graph Explorer

Toggle between two molecules. Despite sharing atom counts, their graphs differ — leading to different predicted properties. Click a node to see its features.

Graph-level features are aggregated from all nodes.

python
# Graph-level prediction with PyTorch Geometric
import torch
from torch_geometric.nn import GCNConv, global_mean_pool
from torch.nn import Linear

class MoleculeClassifier(torch.nn.Module):
    def __init__(self, num_features, hidden_dim, num_classes):
        super().__init__()
        self.conv1 = GCNConv(num_features, hidden_dim)
        self.conv2 = GCNConv(hidden_dim, hidden_dim)
        self.linear = Linear(hidden_dim, num_classes)

    def forward(self, x, edge_index, batch):
        # x: [num_nodes, num_features] — atom features
        # edge_index: [2, num_edges] — bond connections
        # batch: [num_nodes] — which graph each node belongs to

        h = self.conv1(x, edge_index).relu()    # [num_nodes, hidden]
        h = self.conv2(h, edge_index).relu()    # [num_nodes, hidden]

        # Global mean pool: average over all nodes in each graph
        h_G = global_mean_pool(h, batch)          # [num_graphs, hidden]

        return self.linear(h_G)                   # [num_graphs, num_classes]

What problem does graph pooling solve in graph-level prediction?

It makes training faster by reducing the number of edges It converts variable-size graphs into fixed-size vectors for classification It removes noisy nodes before making predictions

Chapter 6: Node Features — Degree, Centrality, Clustering

Before GNNs learned features automatically, researchers hand-engineered them. Understanding these hand-crafted features builds intuition for what GNNs later learn to compute automatically. Each feature captures a different aspect of a node's role in the graph.

Node Degree

The simplest feature: degree k_v = number of edges incident to v. High degree = hub. But degree alone is limited: a node with 5 connections in a tight clique plays a very different role than a node with 5 connections bridging separate communities, even though their degrees are identical.

Clustering Coefficient

The clustering coefficient C_v measures how interconnected v's neighbors are. Specifically: what fraction of all possible edges among v's neighbors actually exist?

C_v = 2 · |{edges among N(v)}| / (k_v · (k_v − 1)) C_v ∈ [0, 1]

C_v = 1: all neighbors are connected to each other (tight clique). C_v = 0: no two neighbors are connected (v is a bridge). In social networks, clustering coefficient identifies community members (high C) vs. connectors between communities (low C).

Centrality Measures

Centrality answers: "how important is this node?" Three definitions, each capturing something different:

Measure	Definition	High value means
Degree centrality	k_v / (n−1)	Many direct connections
Betweenness	Fraction of shortest paths passing through v	Bridge between groups
Closeness	1 / avg. distance to all other nodes	Close to everyone
Eigenvector	Importance from important neighbors	Connected to hubs

Which centrality to use? Betweenness finds "brokers" — nodes that control information flow between groups. Closeness finds "influencers" who spread information fastest. Eigenvector (the basis for PageRank) finds nodes embedded in important neighborhoods. The right choice depends on the task.

Node Feature Inspector

Hover over or tap a node to see its degree, clustering coefficient, and betweenness centrality. High-centrality nodes are highlighted.

Hover over a node to inspect its features.

A node has 5 neighbors, and none of them are connected to each other. What is its clustering coefficient?

1.0 (perfectly clustered) 0.5 (half clustered) 0.0 (no edges among neighbors)

Chapter 7: Graph Features — Kernels and Motifs

Node features describe individual nodes. But to compare entire graphs — "is this molecular graph similar to that one?" — we need features that describe the whole structure. The key challenge: two graphs with the same nodes and edges but with nodes permuted are the same graph. Graph features must be permutation-invariant.

Graphlets: Counting Subgraph Patterns

A graphlet is a small connected subgraph. For 3 nodes, there are 4 graphlets: a path, a triangle, a star with 2 edges, and a complete triangle. For 4 nodes there are 15. The graphlet degree vector (GDV) for a node counts how many times each graphlet appears touching that node — a richer fingerprint than simple degree.

At the whole-graph level, the graphlet kernel counts how many of each graphlet type appear in the entire graph, producing a histogram. Two molecules with similar graphlet histograms have similar local structural patterns, and thus likely similar properties.

Weisfeiler-Lehman Graph Kernel

The WL kernel is an iterative algorithm that hashes each node's neighborhood label into a new label, propagates these labels for K rounds, and then compares histograms of final labels between graphs. It's fast, and — crucially — the message-passing computation of modern GNNs is mathematically equivalent to WL. Understanding WL = understanding what GNNs can and cannot distinguish.

The WL connection: GNNs are at most as powerful as the WL graph isomorphism test. Two graphs that WL cannot distinguish, no GNN can distinguish either. This fundamental limitation (proved in Xu et al., 2019) motivated Graph Isomorphism Networks (GIN) — the GNN that exactly matches WL's expressive power. Graph features and GNN theory are deeply intertwined.

Degree Histogram

The simplest graph feature: count how many nodes have each degree, and store this as a histogram vector. Two random Erdös-Rényi graphs look similar; a social network has a power-law tail. The degree histogram captures this basic structural signature cheaply.

Graph Comparison via Structure

Two graphs shown side by side. Their degree histograms appear as bar charts below. Similar histogram distributions indicate similar structural patterns.

Why must graph-level features be permutation-invariant?

To make computation faster on large graphs Because graphs are always stored in sorted order The same graph with nodes labeled differently must produce the same feature

Chapter 8: The ML Pipeline for Graphs

Now that we understand the three task levels and the kinds of features graphs have, let's look at the two paradigms for machine learning on graphs. They represent the "before" and "after" of the graph ML revolution.

Traditional Pipeline: Hand-Designed Features

Before GNNs, the workflow was:

Domain Expertise

Decide which features matter (degree? clustering? motifs?)

↓

Feature Extraction

Compute those features for every node/edge/graph

↓

Standard ML

Train SVM, random forest, or logistic regression on feature vectors

↓

Prediction

Output label for node / edge / graph

The problem: feature engineering is manual, domain-specific, and expensive. Features that work for social networks fail for molecules. Features that work for node classification fail for link prediction. You need a human expert for every new domain.

The GNN Revolution: Learned Features

The key insight of modern graph ML: learn the features automatically from the graph structure. A GNN replaces the hand-design step entirely.

Raw Graph

Nodes, edges, and initial features (atom type, user age, etc.)

↓

GNN (Message Passing)

Each node aggregates information from its neighbors, iteratively

↓

Learned Embeddings

Each node/graph now has a rich vector z encoding its structural role

↓

Prediction Head

Linear layer maps embeddings to node labels / edge scores / graph classes

Why message passing? A node can only see its immediate neighbors in one GNN layer. After K layers, it has seen its K-hop neighborhood. This is the graph equivalent of a receptive field in a CNN. More layers = wider view, but too many layers cause "over-smoothing" — all nodes start looking the same.

The GNN Formula (Preview)

At each layer, a GNN updates each node's representation by:

Aggregate — collect representations from all neighbors
Combine — merge the neighborhood summary with the node's own representation
Transform — apply a learnable linear layer + nonlinearity

h_v^(k) = σ (W_k · AGGREGATE ({h_u^(k-1) : u ∈ N(v)}) + B_k · h_v^(k-1))

This single formula, with different choices of AGGREGATE (mean, max, sum, attention), gives rise to GCN, GraphSAGE, GAT, and GIN — the four foundational GNN architectures that Lectures 6–8 of CS224W cover in depth.

Traditional ML on graphs:
+ Interpretable features
+ No training data needed
− Manual feature engineering
− Not end-to-end
− Domain-specific

GNNs on graphs:
+ End-to-end learning
+ Domain-agnostic
+ Captures structural patterns automatically
− Needs labeled training data
− Less interpretable

What does "over-smoothing" mean in GNNs, and when does it happen?

The training loss is too smooth; the model doesn't learn The graph has too many edges and gets computationally expensive With too many layers, all node representations converge and become indistinguishable

Chapter 9: Connections — Where to Go From Here

Lecture 1 of CS224W is an invitation. You've seen the three task levels, the canonical applications, and the conceptual shift from hand-designed features to learned GNN representations. Every subsequent lecture in CS224W builds directly on these foundations.

CS224W Roadmap

Lectures	Topic	Key Methods
2–3	Node Embeddings	DeepWalk, node2vec, LINE
4–5	Link Analysis	PageRank, HITS, SimRank
6–8	GNN Foundations	GCN, GraphSAGE, GAT, GIN
9–10	GNN Theory	Expressive power, WL test, over-smoothing
11–12	Knowledge Graphs	TransE, RotatE, KGNN
13–14	Scalable GNNs	GraphSAINT, cluster-GCN, neighbor sampling
15–16	Graph Transformers	GPS, Graphormer
17–19	Generative Models	GraphRNN, GDSS, graph diffusion

Key Papers from This Lecture

Ying et al. "Graph Convolutional Neural Networks for Web-Scale Recommender Systems." KDD 2018 — PinSage interactive lesson
Jumper et al. "Highly accurate protein structure prediction with AlphaFold." Nature 2021.
Stokes et al. "A Deep Learning Approach to Antibiotic Discovery." Cell 2020.
Zitnik et al. "Modeling polypharmacy side effects with graph convolutional networks." Bioinformatics 2018.
Xu et al. "How Powerful are Graph Neural Networks?" ICLR 2019.

Implementation Resources

PyTorch Geometric (PyG) is the standard library for GNN research. It provides efficient sparse graph operations, pre-built GNN layers (GCNConv, SAGEConv, GATConv, GINConv), and datasets (Cora, CiteSeer, OGB). Install with:

bash
# Install PyTorch Geometric (after installing PyTorch)
pip install torch_geometric

# The Open Graph Benchmark — standard evaluation datasets
pip install ogb

# Load the Cora citation network (classic node classification benchmark)
from torch_geometric.datasets import Planetoid
dataset = Planetoid(root='/tmp/Cora', name='Cora')
data = dataset[0]
# data.x: [2708, 1433] — node features (bag-of-words)
# data.edge_index: [2, 10556] — graph connectivity (COO format)
# data.y: [2708] — node labels (7 classes, one per research area)

Textbook: Hamilton, "Graph Representation Learning" (2020). Free PDF at cs.mcgill.ca/~wlh/grl_book/. Dense but authoritative. Chapters 1–2 cover everything from this lecture in greater depth.

What You Can Now Do

After Lecture 1, you can:

Represent any relational dataset as a graph G = (V, E)
Identify whether a task is node-level, edge-level, or graph-level
Compute degree, clustering coefficient, and centrality for any node
Describe the key applications: PinSage, AlphaFold, halicin discovery
Explain why hand-designed features are limiting and what GNNs replace
Load a graph dataset with PyTorch Geometric and inspect its structure

"The history of graph theory is the history of people discovering that everything is connected to everything else — and then trying to figure out what to do about it."
— paraphrased from Jure Leskovec, CS224W Lecture 1