Workbook — Graph Neural Networks (CS224W)

Graph Neural Networks

From adjacency matrices to molecular property prediction. Every GNN concept derived by hand with concrete graphs, exact numbers, and working code — no hand-waving.

Prerequisites: Basic linear algebra (matrix multiply, eigenvalues) + Probability (softmax, dot products). That's it.
10
Chapters
56
Exercises
5
Exercise Types
Mastery
0 / 56 exercises (0%)
0
Day Streak
Best: 0

Chapter 0: Graph Basics

Before touching neural networks, you need to be fluent in the language of graphs. Every GNN algorithm operates on adjacency matrices, degree matrices, and local neighborhood structure. If you cannot construct these by hand for a small graph, you have no hope of understanding what a GNN actually computes.

Consider this 5-node undirected graph:

Edges: (0,1), (0,2), (1,2), (1,3), (2,3), (3,4)

Adjacency matrix A (symmetric for undirected):
    0 1 2 3 4
0 [ 0 1 1 0 0 ]
1 [ 1 0 1 1 0 ]
2 [ 1 1 0 1 0 ]
3 [ 0 1 1 0 1 ]
4 [ 0 0 0 1 0 ]

Degree matrix D = diag(deg(0), deg(1), ..., deg(4)):
D = diag(2, 3, 3, 3, 1)
The degree is the row sum. For undirected graphs, deg(v) = ∑u A[v][u]. The degree matrix D is diagonal with D[i][i] = deg(i). These two matrices — A and D — are the foundation of every GNN computation.
Exercise 0.1: Edge Count Derive

How many edges does the 5-node graph above have? (Hint: count 1s in A, then adjust for symmetry.)

edges
Show derivation
Number of 1s in A = 12
Each undirected edge appears twice (A[i][j] and A[j][i])
Edges = 12 / 2 = 6

Equivalently, |E| = ½ ∑i deg(i) = ½(2 + 3 + 3 + 3 + 1) = 6. The sum of all degrees always equals twice the number of edges — this is the handshaking lemma.

Exercise 0.2: Average Degree Derive

What is the average degree of the 5-node graph?

avg degree
Show derivation
Degrees: [2, 3, 3, 3, 1]
Average = (2 + 3 + 3 + 3 + 1) / 5 = 12 / 5 = 2.4

Equivalently, average degree = 2|E| / |V| = 2 × 6 / 5 = 2.4. For sparse real-world graphs (social networks, molecules), the average degree is typically O(1) even as |V| grows to millions.

Exercise 0.3: Clustering Coefficient Derive

Compute the local clustering coefficient for node 1. The formula is C(v) = 2 × (edges among neighbors) / (deg(v) × (deg(v) - 1)). Node 1's neighbors are {0, 2, 3}.

Count how many edges exist among {0, 2, 3}, then plug into the formula.

clustering coeff
Show derivation
Neighbors of node 1: {0, 2, 3}
Edges among neighbors: (0,2) ✓, (0,3) ✗, (2,3) ✓ → 2 edges
C(1) = 2 × 2 / (3 × 2) = 4 / 6 = 0.667

The clustering coefficient measures how close a node's neighbors are to forming a clique. C = 1 means all neighbors are connected to each other. C = 0 means no neighbors are connected. Node 1 has a fairly high clustering coefficient — 2 out of 3 possible neighbor-pairs are connected.

Exercise 0.4: A² Interpretation Trace
If you compute A² (adjacency matrix squared), what does entry A²[i][j] represent?
Show explanation
A²[i][j] = ∑k A[i][k] × A[k][j]

Each term A[i][k] × A[k][j] = 1 only if there's an edge i→k AND k→j — a length-2 path through k. The sum counts all such paths. More generally, An[i][j] counts paths of length n. This is why matrix powers of A encode multi-hop connectivity — and why GNN layers (which multiply by A) aggregate information from k-hop neighborhoods after k layers.

Exercise 0.5: A²[0][3] Derive

Using the adjacency matrix above, compute A²[0][3]. That is, how many 2-hop paths exist from node 0 to node 3?

paths
Show derivation
A²[0][3] = ∑k A[0][k] × A[k][3]
k=0: A[0][0]×A[0][3] = 0×0 = 0
k=1: A[0][1]×A[1][3] = 1×1 = 1   // path 0→1→3
k=2: A[0][2]×A[2][3] = 1×1 = 1   // path 0→2→3
k=3: A[0][3]×A[3][3] = 0×0 = 0
k=4: A[0][4]×A[4][3] = 0×1 = 0
A²[0][3] = 0 + 1 + 1 + 0 + 0 = 2

Two 2-hop paths: 0→1→3 and 0→2→3. Node 0 has no direct edge to node 3 (A[0][3] = 0), but after 2 hops, there are 2 ways to get there. This is exactly the information a 2-layer GNN would aggregate.

Exercise 0.6: Implement degreeMatrix() Build

Given an adjacency matrix as a 2D array, return the degree matrix (also 2D, diagonal).

D[i][i] = sum of row i of A. All off-diagonal entries are 0.
Show solution
javascript
function degreeMatrix(A) {
  const n = A.length;
  const D = Array.from({length: n}, () => Array(n).fill(0));
  for (let i = 0; i < n; i++) {
    D[i][i] = A[i].reduce((s, v) => s + v, 0);
  }
  return D;
}

Chapter 1: Node Embeddings

Before GNNs existed, the dominant approach was to learn node embeddings using random walks. The idea: if two nodes appear near each other in random walks on the graph, they should have similar embedding vectors. This is the graph equivalent of word2vec — "you shall know a node by the company it keeps."

DeepWalk: Uniform random walk from each node, then apply skip-gram.
Node2Vec: Biased random walk with parameters p (return) and q (in-out).

At node t, having come from node s, probability of moving to node x:
α(s, x) = 1/p if d(s,x) = 0   // return to source
α(s, x) = 1    if d(s,x) = 1   // move to neighbor of source
α(s, x) = 1/q if d(s,x) = 2   // move away from source

Unnormalized probability: α(s, x) × w(t, x)
where w(t, x) = 1 if edge (t,x) exists, 0 otherwise.
p controls going back, q controls exploration. Low p = high probability of returning (BFS-like, captures local structure). Low q = high probability of exploring outward (DFS-like, captures global community structure). DeepWalk is the special case p = q = 1.
Exercise 1.1: Random Walk Neighbors Trace

Using the 5-node graph from Ch 0, starting at node 0 with a uniform random walk: what are the possible nodes at step 1? And from node 1 (if we land there), what are the possible nodes at step 2?

If we start at node 0 and take 2 steps of a uniform random walk, which node is IMPOSSIBLE to reach?
Show trace
Step 0: node 0
Step 1: neighbors of 0 = {1, 2}, each with prob 1/2
If step 1 = node 1 → step 2 ∈ {0, 2, 3}, each with prob 1/3
If step 1 = node 2 → step 2 ∈ {0, 1, 3}, each with prob 1/3
Reachable at step 2: {0, 1, 2, 3}. Node 4 NOT reachable.

Node 4 is 3 hops from node 0 (0→1→3→4 or 0→2→3→4). This means a walk of length 2 cannot possibly reach it. Longer walks capture more distant structural information — but also blur local patterns.

Exercise 1.2: Node2Vec Transition Derive

In Node2Vec with p=1, q=2: we just walked from node 2 to node 1 (so source s=2, current t=1). The neighbors of node 1 are {0, 2, 3}. Compute the normalized transition probabilities to each neighbor.

For each neighbor x of t=1: check d(s=2, x), compute α, then normalize.

P(move to node 3)
Show derivation
Current t=1, source s=2. Neighbors of 1: {0, 2, 3}
x=0: d(s=2, x=0) = 1 (edge 2-0 exists) → α = 1
x=2: d(s=2, x=2) = 0 (x IS the source) → α = 1/p = 1/1 = 1
x=3: d(s=2, x=3) = 1 (edge 2-3 exists) → α = 1

Wait — let's re-check. d(s,x) is shortest-path distance in the original graph between s=2 and x.

d(2,0) = 1 (direct edge) → α = 1
d(2,2) = 0 → α = 1/p = 1
d(2,3) = 1 (direct edge) → α = 1

Hmm, all neighbors of t=1 happen to be direct neighbors of s=2, so α = 1 for all. But for q to matter, we need a neighbor of t that is NOT a neighbor of s. Let's consider a different scenario.

Actually, let's trace from s=0 to t=1 instead, where node 3 is a neighbor of t=1 but d(0,3) = 2:

s=0, t=1. Neighbors of 1: {0, 2, 3}
x=0: d(0,0) = 0 → α = 1/p = 1/1 = 1
x=2: d(0,2) = 1 (edge 0-2 exists) → α = 1
x=3: d(0,3) = 2 (no direct edge) → α = 1/q = 1/2 = 0.5
Sum = 1 + 1 + 0.5 = 2.5
P(x=0) = 1/2.5 = 0.4,   P(x=2) = 1/2.5 = 0.4,   P(x=3) = 0.5/2.5 = 0.2

With q=2, the walk is discouraged from moving to node 3 (which is far from the source). This makes the walk more BFS-like, staying near the source's local neighborhood. With q=0.5, it would be encouraged to explore outward.

Exercise 1.3: Skip-gram Objective Trace
In the skip-gram objective for DeepWalk, given a random walk [..., u, v, w, ...] with window size 1, what pairs are used as positive training examples?
Show explanation

Skip-gram takes each node in the walk as a "center word" and pairs it with all nodes within the window. With window=1, node v is paired with u (left context) and w (right context). When u is the center, it's paired with v. When w is the center, it's paired with v. All symmetric pairs are generated.

Maximize: ∑centercontext log P(context | center)
P(context | center) = softmax(zcontextT · zcenter)

In practice, negative sampling replaces the expensive softmax. For each positive pair, sample k random "negative" nodes and train a binary classifier to distinguish real context from noise.

Exercise 1.4: Embedding Dimension Trade-off Trace
A graph has 10,000 nodes. You're choosing an embedding dimension d for DeepWalk. Which is the best reasoning?
Show reasoning

Typical embedding dimensions are 64-256. Too small and you can't capture the graph's structure. Too large and you overfit (especially with limited random walks) and waste memory. The embedding table has |V| × d parameters — at d = 10,000 that's 100M parameters just for the lookup table on a 10K-node graph, which is absurdly overparameterized.

Exercise 1.5: Implement randomWalk() Build

Given an adjacency list (array of arrays) and a start node, return a random walk of the specified length. Use uniform random neighbor selection (DeepWalk-style).

Walk[0] = start. At each step, pick a uniform random neighbor. Return the full path. For testing, we'll check the walk is valid (each step follows an edge).
Show solution
javascript
function randomWalk(adjList, start, length) {
  const walk = [start];
  let curr = start;
  for (let i = 0; i < length; i++) {
    const neighbors = adjList[curr];
    curr = neighbors[Math.floor(Math.random() * neighbors.length)];
    walk.push(curr);
  }
  return walk;
}

Chapter 2: GNN Message Passing

The fundamental operation in any GNN is message passing: each node collects features from its neighbors, aggregates them, and updates its own representation. One layer of a Graph Convolutional Network (GCN) computes:

GCN layer (simplified):
H(l+1) = σ(D̄-1 Ã H(l) W(l))

Where:
à = A + I   // adjacency + self-loops
D̄ = degree matrix of à   // D̄[i][i] = ∑j Ã[i][j]
H(l): [N, din]   // node features at layer l
W(l): [din, dout]   // learnable weight matrix
σ: nonlinearity (ReLU)

What D̄-1Ã does: normalizes each row by the degree.
Row i of D̄-1Ã = [Ã[i][0]/deg(i), Ã[i][1]/deg(i), ...]
So the update for node i is: average of its own + neighbors' features.
Message passing = neighborhood averaging. D-1A replaces each node's features with the mean of its neighbors' features. Adding self-loops (A + I) ensures the node also keeps its own features in the average. The weight matrix W then projects these averaged features to learn useful representations.

Let's trace this on a concrete 4-node graph:

4-node path graph: 0 — 1 — 2 — 3
A = [[0,1,0,0],[1,0,1,0],[0,1,0,1],[0,0,1,0]]
à = A + I = [[1,1,0,0],[1,1,1,0],[0,1,1,1],[0,0,1,1]]
D̄ = diag(2, 3, 3, 2)
X = [[1,0],[0,1],[1,1],[0,0]]   // 4 nodes, 2 features each
W = [[1,0],[0,1]]   // identity (for clarity)
Exercise 2.1: Normalized Adjacency Derive

Compute D̄-1Ã for the 4-node path graph above. What is the entry at row 1, column 2? (Row 1 = node 1's update equation.)

(D̄-1Ã)[1][2]
Show derivation
-1 = diag(1/2, 1/3, 1/3, 1/2)
Row 1 of à = [1, 1, 1, 0]
Row 1 of D̄-1Ã = [1/3, 1/3, 1/3, 0]
(D̄-1Ã)[1][2] = 1/3 ≈ 0.333

Node 1 has degree 3 in à (neighbors 0, 2, plus self-loop). So each neighbor's feature contributes 1/3 to the average. This is the "mean aggregation" that defines GCN.

Exercise 2.2: One GCN Layer on Node 1 Derive

Using D̄-1Ã from above and X = [[1,0],[0,1],[1,1],[0,0]], W = I (identity), compute the output feature for node 1 after one GCN layer (before ReLU). What is the first component?

h'1 = (D̄-1Ã)[1,:] · X · W. With W = I, this is just the weighted average of neighbor features.

h'1[0]
Show derivation
h'1 = 1/3 × x0 + 1/3 × x1 + 1/3 × x2 + 0 × x3
= 1/3 × [1,0] + 1/3 × [0,1] + 1/3 × [1,1]
= [1/3, 0] + [0, 1/3] + [1/3, 1/3]
= [2/3, 2/3] = [0.667, 0.667]

Node 1 averages its own features [0,1] with neighbors 0's [1,0] and 2's [1,1]. The output [0.667, 0.667] is the centroid of those three feature vectors. After ReLU (which is a no-op since both values are positive), this becomes node 1's new representation.

Exercise 2.3: Full Layer Output Derive

Compute the full output H' = D̄-1ÃX (with W=I, before ReLU) for all 4 nodes. What is h'3[0] (node 3, first feature)?

h'3[0]
Show derivation
h'0 = 1/2 × [1,0] + 1/2 × [0,1] = [0.5, 0.5]
h'1 = 1/3 × [1,0] + 1/3 × [0,1] + 1/3 × [1,1] = [0.667, 0.667]
h'2 = 1/3 × [0,1] + 1/3 × [1,1] + 1/3 × [0,0] = [0.333, 0.667]
h'3 = 1/2 × [1,1] + 1/2 × [0,0] = [0.5, 0.5]

Notice: nodes 0 and 3 have the same output [0.5, 0.5] even though they started with different features! Both are degree-1 nodes at the ends of a path, so they average their own features with their one neighbor's. After one layer, symmetrical positions in the graph get similar representations. This is a preview of the over-smoothing problem we'll study in Chapter 4.

Exercise 2.4: Why Self-Loops? Trace
What goes wrong if we use D-1A instead of D̄-1Ã (no self-loops)?
Show explanation

Without self-loops, D-1A averages ONLY the neighbors' features, throwing away the node's own features entirely. After one layer, node 0 with features [1,0] would get h' = x1 = [0,1] (its only neighbor). Its own identity is completely lost. Adding I to A ensures each node includes itself in the aggregation.

Exercise 2.5: GCN Parameter Count Derive

A 3-layer GCN with dimensions 32 → 64 → 64 → 16 (input features → hidden → hidden → output classes). Each layer has W(l) (no bias). How many learnable parameters total?

parameters
Show derivation
W(0): [32, 64] = 2,048 params
W(1): [64, 64] = 4,096 params
W(2): [64, 16] = 1,024 params
Total = 2,048 + 4,096 + 1,024 = 7,168

Notice: the parameter count does NOT depend on the number of nodes or edges! The same W is applied to every node. This is why GNNs generalize across graphs of different sizes — unlike MLPs, which would need N × din parameters in the first layer alone.

Exercise 2.6: Find the Bug Debug

This GCN forward pass has a bug. Node features are exploding after a few layers. Click the buggy line.

function gcnLayer(A, X, W) {
  const n = A.length;
  // Add self-loops
  const At = A.map((row,i) => row.map((v,j) => i===j ? v+1 : v));
  // Compute degree
  const deg = At.map(row => row.reduce((s,v) => s+v, 0));
  // Multiply At * X (no normalization!)
  const AX = matmul(At, X);
  // Project and apply ReLU
  return relu(matmul(AX, W));
}
Show explanation

Line 7 is the bug. The code computes ÃX but never normalizes by D̄-1. It computes the degree on line 6 but never uses it! Without D̄-1 normalization, each node sums its neighbors' features instead of averaging. High-degree nodes accumulate huge values, causing features to explode exponentially with each layer.

The fix: divide each row of AX by the corresponding degree. AX[i] = AX[i].map(v => v / deg[i]).

Chapter 3: GCN vs GraphSAGE vs GAT

Three GNN architectures dominate practice. They all follow the message passing framework but differ in HOW they aggregate neighbor information. Let's trace one layer of each on the same graph to see the difference.

Same 4-node graph: 0 — 1 — 2 — 3
X = [[1,0],[0,1],[1,1],[0,0]], focusing on node 1 (neighbors: {0, 2})

GCN: h'v = σ( W · MEAN({xu : u ∈ N(v) ∪ {v}}) )
GraphSAGE: h'v = σ( W · CONCAT(xv, AGG({xu : u ∈ N(v)})) )
GAT: h'v = σ( ∑u ∈ N(v) ∪ {v} αvu · W · xu )

GAT attention: αvu = softmaxu( LeakyReLU( aT [Wxv || Wxu] ) )
Three flavors of aggregation. GCN: fixed mean (all neighbors equal). GraphSAGE: separates self from neighbors, supports sampling. GAT: learned attention weights (some neighbors matter more). The choice depends on your graph — homogeneous graphs favor GCN, heterogeneous favor GAT.
Exercise 3.1: GCN for Node 2 Derive

Compute GCN output for node 2 (with self-loop). N(2) ∪ {2} = {1, 2, 3}. Features: x1=[0,1], x2=[1,1], x3=[0,0]. W = I. What is h'2[1] (second component, before ReLU)?

h'2[1]
Show derivation
h'2 = MEAN(x1, x2, x3) = ([0,1] + [1,1] + [0,0]) / 3
= [1/3, 2/3] = [0.333, 0.667]

With W = I, GCN simply averages the features. h'2[1] = 0.667. The mean operation treats all neighbors equally — node 3 with its "empty" features [0,0] dilutes the average just as much as informative node 1.

Exercise 3.2: GraphSAGE for Node 2 Derive

GraphSAGE concatenates the node's own features with the aggregated neighbor features. For node 2, neighbors = {1, 3}. Using mean aggregation and W = I (applied after concatenation to [dself + dneighbors] → dout). What is the concatenated vector before W is applied?

Show derivation
Self features: x2 = [1, 1]
Neighbor aggregate: MEAN(x1, x3) = MEAN([0,1], [0,0]) = [0, 0.5]
CONCAT(x2, AGG) = [1, 1, 0, 0.5]

The key difference from GCN: the node's OWN features are kept separate from the neighbors' aggregate. This means the weight matrix W can learn different transformations for self-information vs. neighborhood information. The concatenated vector has dimension 2d (4 in this case), and W maps it back to dout.

Exercise 3.3: GAT Attention Weights Derive

In GAT, attention for node 2 over neighbors {1, 2(self), 3} with W = I: attention score evu = aT[Wxv || Wxu]. Let a = [0.2, 0.3, -0.1, 0.4] (length 2d=4). Compute e2,1 (attention score from node 2 to node 1, before softmax).

e2,1
Show derivation
Wx2 = x2 = [1, 1] (since W = I)
Wx1 = x1 = [0, 1]
[Wx2 || Wx1] = [1, 1, 0, 1]
e2,1 = aT · [1, 1, 0, 1] = 0.2×1 + 0.3×1 + (-0.1)×0 + 0.4×1 = 0.9

Wait — let me recompute carefully:

0.2 × 1 + 0.3 × 1 + (-0.1) × 0 + 0.4 × 1 = 0.2 + 0.3 + 0 + 0.4 = 0.9

Hmm, I stated 0.4 as the answer. Let me re-examine the setup. With a = [0.2, 0.3, -0.1, 0.4]:

e2,1 = 0.2(1) + 0.3(1) + (-0.1)(0) + 0.4(1) = 0.9

The raw attention score is 0.9. After LeakyReLU (positive value, so unchanged) and softmax over all neighbors, this becomes a normalized weight. The key insight: GAT learns a to assign different importance to different neighbors based on their features.

Exercise 3.4: Compare Aggregations Trace
A social network has a celebrity node with 10,000 followers and a regular user with 5 friends. Which aggregation handles this degree imbalance best?
Show reasoning

GraphSAGE's neighbor sampling is the key innovation for scalability. Instead of aggregating all 10,000 neighbors, it samples a fixed k (typically 10-25). This makes computation cost per node O(k) regardless of degree. GCN's mean aggregation still requires touching all neighbors, and GAT needs to compute attention scores for all neighbors. In practice, mini-batch training with sampling (as in GraphSAGE and later DGL/PyG implementations) is essential for large graphs.

Exercise 3.5: GAT Multi-Head Derive

A GAT layer uses K=8 attention heads, each with dout=8. Intermediate heads are concatenated, final head is averaged. How many output features after: (a) an intermediate layer? (b) the final layer?

intermediate output dim
Show derivation
Intermediate: CONCAT(head1, ..., head8) → 8 × 8 = 64 features
Final: AVG(head1, ..., head8) → 8 features

The GAT paper uses concatenation for hidden layers (to preserve expressiveness) but averaging for the output layer (to produce a fixed-size output for classification). This is analogous to multi-head attention in transformers, where heads are concatenated then projected.

Chapter 4: Over-Smoothing

You just built a 10-layer GCN and... it performs worse than a 2-layer one. What happened? Each GCN layer averages a node's features with its neighbors. After k layers, each node's representation is influenced by all nodes within k hops. On small-diameter graphs, a few layers are enough to mix ALL nodes' features together, making every representation converge to the same vector. This is over-smoothing.

After k layers of GCN (simplified, no nonlinearity):
H(k) = (D̄-1Ã)k X Weff

The matrix (D̄-1Ã)k is a k-step random walk matrix.
As k → ∞, each row converges to the stationary distribution:
πi = deg(i) / 2|E| for undirected graphs.

Result: all rows become identical → all node features identical → useless.
More layers ≠ better. Unlike transformers where 96 layers is common, GNNs typically use 2-4 layers. The graph diameter determines when over-smoothing kicks in. For a graph with diameter d, a d-layer GNN already connects every node to every other node.
Exercise 4.1: Path Graph Diameter Derive

For a path graph P5 (5 nodes in a line: 0-1-2-3-4), what is the diameter? (Diameter = longest shortest path between any two nodes.)

diameter
Show derivation
Shortest path from 0 to 4 = 4 edges (0→1→2→3→4)
This is the maximum over all pairs → diameter = 4

On this graph, a 4-layer GCN lets every node see every other node. After that, additional layers only smooth further. For real-world social networks with diameter ~6 (six degrees of separation), even a 3-layer GNN covers most of the graph.

Exercise 4.2: Laplacian of P3 Derive

The graph Laplacian is L = D - A. For the path graph P3 (nodes 0-1-2): compute L and find its eigenvalues. (Hint: L is a 3×3 symmetric matrix.)

The smallest eigenvalue of any graph Laplacian is always 0 (eigenvector = all-ones). The second-smallest eigenvalue λ2 is the algebraic connectivity (Fiedler value).

λ2 (second eigenvalue)
Show derivation
P3: A = [[0,1,0],[1,0,1],[0,1,0]], D = diag(1,2,1)
L = D - A = [[1,-1,0],[-1,2,-1],[0,-1,1]]
det(L - λI) = 0
(1-λ)[(2-λ)(1-λ) - 1] - (-1)[(-1)(1-λ)] = 0
(1-λ)(2 - 3λ + λ² - 1) + (-(1-λ)) = 0
(1-λ)(λ² - 3λ + 1) - (1-λ) = 0
(1-λ)(λ² - 3λ) = 0
λ(1-λ)(λ - 3) = 0
Eigenvalues: λ = 0, 1, 3

λ2 = 1. This measures how well-connected the graph is. For the complete graph K3, λ2 = 3 (maximally connected). For a disconnected graph, λ2 = 0. Higher λ2 means faster mixing, which means faster over-smoothing in GNNs.

Exercise 4.3: Smoothing Rate Trace
You run a GCN on two graphs: (A) a star graph (one center connected to all others) with diameter 2, and (B) a chain graph with diameter 20. Which over-smooths faster?
Show reasoning

The star graph has diameter 2, so after 2 GCN layers every node's receptive field covers the entire graph. The chain has diameter 20, so you'd need 20 layers to fully connect it. Over-smoothing is directly tied to the spectral gap (λ2) of the normalized Laplacian — larger gap = faster convergence to the stationary distribution = faster over-smoothing. The star graph has a large spectral gap.

Exercise 4.4: Residual Connections Trace
Which technique does NOT help mitigate over-smoothing?
Show reasoning

Increasing width does NOT help with over-smoothing. The problem is that (D̄-1Ã)k converges to rank-1 regardless of feature dimension. Residual connections preserve the original signal. JK-Net accesses early-layer representations that haven't been smoothed yet. DropEdge slows the mixing by removing random edges each epoch. All three are proven strategies; wider layers are not.

Exercise 4.5: MADGap Metric Derive

The MAD (Mean Average Distance) between node representations measures over-smoothing. After 1 layer on our 4-node path graph, H' = [[0.5,0.5],[0.667,0.667],[0.333,0.667],[0.5,0.5]]. Compute the MAD: average pairwise L2 distance between all node representations.

There are C(4,2) = 6 pairs. Compute L2 distance for each, then average.

MAD
Show derivation
d(0,1) = √((0.667-0.5)² + (0.667-0.5)²) = √(0.028 + 0.028) = 0.236
d(0,2) = √((0.333-0.5)² + (0.667-0.5)²) = √(0.028 + 0.028) = 0.236
d(0,3) = √((0.5-0.5)² + (0.5-0.5)²) = 0
d(1,2) = √((0.333-0.667)² + (0.667-0.667)²) = 0.333
d(1,3) = √((0.5-0.667)² + (0.5-0.667)²) = 0.236
d(2,3) = √((0.5-0.333)² + (0.5-0.667)²) = 0.236
MAD = (0.236 + 0.236 + 0 + 0.333 + 0.236 + 0.236) / 6 = 0.213

The MAD after the initial features X is much higher. As you add layers, MAD decreases toward 0 — that's over-smoothing quantified. When MAD ≈ 0, all node representations are essentially identical and classification becomes impossible.

Chapter 5: Spectral Graph Theory

GCNs are actually an approximation of spectral graph convolution. To understand why, you need the graph Laplacian and its eigenvectors — the "Fourier basis" of the graph.

Graph Laplacian: L = D - A
Normalized Laplacian: Lnorm = I - D-1/2 A D-1/2
Eigendecomposition: L = U Λ UT

Spectral convolution:
gθ * x = U gθ(Λ) UT x

Chebyshev approximation (ChebNet):
gθ * x ≈ ∑k=0K θk Tk(L̃) x
where L̃ = 2L/λmax - I, and Tk are Chebyshev polynomials.

GCN simplification (K=1):
gθ * x ≈ θ(I + D-1/2 A D-1/2) x
GCN IS a first-order Chebyshev filter. When Kipf & Welling set K=1 and approximate λmax ≈ 2, the spectral convolution simplifies to the familiar D̄-1/2ÃD̄-1/2 formula. Every GCN layer is literally a low-pass filter on the graph spectrum — it smooths out high-frequency signals (differences between neighbors), which explains over-smoothing.
Exercise 5.1: Laplacian of K3 Derive

Compute the graph Laplacian L = D - A for the complete graph K3 (3 nodes, every pair connected). Then find all three eigenvalues.

λmax
Show derivation
K3: A = [[0,1,1],[1,0,1],[1,1,0]], D = diag(2,2,2)
L = D - A = [[2,-1,-1],[-1,2,-1],[-1,-1,2]]
det(L - λI) = (2-λ)[(2-λ)² - 1] + 1[-(2-λ) - 1] + (-1)[1 + (2-λ)]

By symmetry of Kn, the Laplacian eigenvalues are known: λ = 0 (multiplicity 1) and λ = n (multiplicity n-1).

For K3: λ = 0, 3, 3

The eigenvector for λ=0 is always [1,1,...,1]/√n (constant signal). The eigenvectors for λ=3 span the space orthogonal to this — they represent "high frequency" signals where adjacent nodes differ. A GCN (low-pass filter) suppresses these λ=3 components, smoothing towards the constant eigenvector.

Exercise 5.2: Normalized Laplacian Derive

For K3, compute Lnorm = I - D-1/2AD-1/2. What is Lnorm[0][1]?

D-1/2 = diag(1/√2, 1/√2, 1/√2) for K3.

Lnorm[0][1]
Show derivation
D-1/2 A D-1/2[0][1] = (1/√2) × A[0][1] × (1/√2) = 1 × 1/2 = 0.5
Lnorm[0][1] = I[0][1] - 0.5 = 0 - 0.5 = -0.5

The normalized Laplacian has eigenvalues in [0, 2]. For K3: λ = 0, 3/2, 3/2. The normalization rescales the spectrum into a fixed range, which is why Kipf & Welling approximate λmax ≈ 2 for the Chebyshev polynomial.

Exercise 5.3: Chebyshev Polynomials Derive

Chebyshev polynomials: T0(x) = 1, T1(x) = x, Tk(x) = 2xTk-1(x) - Tk-2(x). Compute T2(x) and T3(x).

What is T3(x)?
Show derivation
T0(x) = 1
T1(x) = x
T2(x) = 2x · T1(x) - T0(x) = 2x² - 1
T3(x) = 2x · T2(x) - T1(x) = 2x(2x² - 1) - x = 4x³ - 2x - x = 4x³ - 3x

In ChebNet, K=3 means the filter uses T0(L̃), T1(L̃), T2(L̃), T3(L̃) with learnable coefficients θ0,...,θ3. Each Tk(L̃)x involves exactly k hops of neighborhood aggregation, so K controls the receptive field size. GCN (K=1) only uses T0 and T1.

Exercise 5.4: Spectral Complexity Trace
Full spectral convolution requires eigendecomposing L, which costs O(N³). How does ChebNet avoid this?
Show reasoning

The Chebyshev recurrence Tk(L)x = 2L · Tk-1(L)x - Tk-2(L)x only needs matrix-vector multiplies with L. Since L is sparse (|E| nonzeros), each multiply is O(|E|). K iterations cost O(K|E|), which is linear in graph size. No eigendecomposition needed. This is the fundamental insight that made spectral GNNs practical.

Exercise 5.5: GCN as Low-Pass Filter Derive

The GCN filter in the spectral domain is g(λ) = 1 - λ (for the normalized Laplacian). For eigenvalues λ = 0 and λ = 1.5 of K3, what are the filter responses g(0) and g(1.5)? Which component is amplified?

g(1.5)
Show derivation
g(0) = 1 - 0 = 1   // low-frequency: fully preserved
g(1.5) = 1 - 1.5 = -0.5   // high-frequency: attenuated (and flipped)

The low-frequency component (λ = 0, the "constant" signal) has gain 1 — perfectly preserved. The high-frequency component (λ = 1.5) has gain -0.5 — attenuated by half. After k layers, this becomes (-0.5)k, which rapidly goes to 0. This IS over-smoothing, viewed through the spectral lens: high-frequency information (node differences) is exponentially suppressed.

Chapter 6: Knowledge Graphs

Knowledge graphs encode facts as (head, relation, tail) triples — (Einstein, born_in, Germany). The challenge: embed entities and relations into continuous vector spaces so that the geometric structure reflects semantic relationships. Two classic approaches:

TransE: h + r ≈ t
Score: f(h, r, t) = -||h + r - t||
Higher score (less negative) = more plausible triple.

DistMult: Score = ∑i hi · ri · ti
Symmetric: score(h, r, t) = score(t, r, h)

ComplEx: Re(∑i hi · ri · t̄i)
Complex-valued embeddings. Handles asymmetric relations.
TransE = translation in embedding space. "born_in" is a vector r. Adding r to Einstein's embedding should land near Germany's embedding. If it does, the fact is plausible. Training pushes h+r close to t for real triples and away for fake ones. The beauty: similar relations get similar vectors, so the model can generalize to unseen triples.
Exercise 6.1: TransE Score Derive

Given embeddings: h = [1, 2, 0], r = [1, -1, 1], t = [2, 1, 1]. Compute the TransE score f(h, r, t) = -||h + r - t||2.

score
Show derivation
h + r = [1+1, 2+(-1), 0+1] = [2, 1, 1]
h + r - t = [2-2, 1-1, 1-1] = [0, 0, 0]
||h + r - t||2 = 0
f(h, r, t) = -0 = 0

A perfect score of 0 means the translation h + r lands exactly on t. In practice, scores are slightly negative due to imperfect embeddings. The loss function (margin-based) pushes correct triples toward 0 and incorrect ones below -γ (the margin).

Exercise 6.2: DistMult Score Derive

Same embeddings: h = [1, 2, 0], r = [1, -1, 1], t = [2, 1, 1]. Compute the DistMult score = ∑i hi · ri · ti.

score
Show derivation
i=0: 1 × 1 × 2 = 2
i=1: 2 × (-1) × 1 = -2
i=2: 0 × 1 × 1 = 0
Score = 2 + (-2) + 0 = 0

Both TransE and DistMult give 0 for this triple, but for different geometric reasons. TransE measures translational distance; DistMult measures multiplicative interaction. Note that DistMult(h, r, t) = DistMult(t, r, h) — it's symmetric. This means it can't distinguish "born_in" from "birthplace_of." TransE can, because h + r = t but t + r ≠ h.

Exercise 6.3: TransE Limitation Trace
TransE models h + r ≈ t. What type of relation can it NOT model well?
Show reasoning

If (Einstein, born_in, Germany) and (Hilbert, born_in, Germany) are both true, then hEinstein + r ≈ tGermany and hHilbert + r ≈ tGermany. This forces hEinstein ≈ hHilbert. For 1-to-N relations, all head entities sharing the same (relation, tail) get collapsed to the same point. Symmetric relations are also problematic: h + r = t and t + r = h forces r = 0. RotatE and ComplEx address these issues using rotations/complex numbers.

Exercise 6.4: Negative Sampling Derive

In TransE training with margin γ = 1, you have positive triple (h, r, t) with score -0.3 and negative triple (h, r, t') with score -1.5. Compute the margin loss: max(0, γ + fpos - fneg). Should the model update?

loss
Show derivation
Loss = max(0, 1 + (-0.3) - (-1.5))
= max(0, 1 - 0.3 + 1.5) = max(0, 2.2)

Wait — the TransE score for the positive should be HIGHER (closer to 0) than the negative. Let me reconsider the formula. The standard margin loss is:

L = max(0, γ + dpos - dneg)

where d = ||h + r - t|| (distance, not score). So dpos = 0.3, dneg = 1.5:

L = max(0, 1 + 0.3 - 1.5) = max(0, -0.2) = 0

Actually, the answer depends on whether we use score or distance. Using the negative score formulation where we want scorepos > scoreneg + γ:

L = max(0, γ - scorepos + scoreneg)
= max(0, 1 - (-0.3) + (-1.5)) = max(0, 1 + 0.3 - 1.5) = max(0, -0.2) = 0

Loss = 0. No update needed. The positive triple already scores higher than the negative by more than the margin. The model has successfully separated this pair.

Exercise 6.5: KG Embedding Parameters Derive

A knowledge graph has 15,000 entities, 237 relations, and you use TransE with embedding dimension d = 200. How many parameters total?

million params
Show derivation
Entity embeddings: 15,000 × 200 = 3,000,000
Relation embeddings: 237 × 200 = 47,400
Total = 3,047,400 ≈ 3.05M

The entity embeddings dominate. This is the FB15k-237 benchmark scale. Larger KGs like Wikidata (100M+ entities) at d=200 would need 20B+ parameters just for entity embeddings — this is why KG embedding at web scale requires careful engineering (e.g., PBG by Facebook uses distributed training).

Chapter 7: Link Prediction

Given a graph with some edges, predict which edges are missing. This is the bread-and-butter task of graph ML — friend recommendation, drug-target interaction, paper citation prediction. We'll cover both classical heuristics and GNN-based approaches.

Classical heuristics for nodes u, v:
Common Neighbors: CN(u,v) = |N(u) ∩ N(v)|
Jaccard: J(u,v) = |N(u) ∩ N(v)| / |N(u) ∪ N(v)|
Adamic-Adar: AA(u,v) = ∑w ∈ N(u) ∩ N(v) 1 / log(deg(w))

GNN-based:
score(u,v) = zuT · zv   // dot product of learned embeddings
Adamic-Adar penalizes popular common neighbors. A common neighbor with degree 1000 (a celebrity) is less informative than one with degree 5 (a niche interest). The 1/log(deg) weighting captures this — rare shared neighbors matter more. This simple heuristic is surprisingly hard for GNNs to beat on many benchmarks.

Using our 5-node graph: edges (0,1), (0,2), (1,2), (1,3), (2,3), (3,4).

Exercise 7.1: Common Neighbors Derive

Compute CN(0, 3) — the number of common neighbors of node 0 and node 3.

common neighbors
Show derivation
N(0) = {1, 2}
N(3) = {1, 2, 4}
N(0) ∩ N(3) = {1, 2}
CN(0, 3) = 2

Two common neighbors suggest a likely edge between 0 and 3. In social networks, this is "friends of friends" — the more mutual friends two people have, the more likely they should be connected.

Exercise 7.2: Jaccard Coefficient Derive

Compute J(0, 3) = |N(0) ∩ N(3)| / |N(0) ∪ N(3)|.

Jaccard
Show derivation
N(0) ∪ N(3) = {1, 2, 4}
J(0, 3) = |{1, 2}| / |{1, 2, 4}| = 2/3 ≈ 0.667

Jaccard normalizes by the union. This prevents high-degree nodes from dominating: CN(celebrity, anyone) is large simply because the celebrity has many neighbors. Jaccard says "what fraction of their combined connections do they share?" — a more meaningful similarity measure.

Exercise 7.3: Adamic-Adar Derive

Compute AA(0, 3). Common neighbors are {1, 2} with degrees deg(1) = 3, deg(2) = 3.

Adamic-Adar
Show derivation
AA(0, 3) = 1/log(deg(1)) + 1/log(deg(2))
= 1/log(3) + 1/log(3)
= 1/1.099 + 1/1.099 = 0.910 + 0.910 = 1.82

If the common neighbors had degree 100 instead: AA = 2/log(100) = 2/4.605 = 0.434 — much lower. High-degree common neighbors contribute less. Adamic-Adar consistently outperforms CN and Jaccard on real-world link prediction benchmarks.

Exercise 7.4: GNN Link Score Derive

After running a GNN, node 0 has embedding z0 = [0.5, 0.5, 0.3] and node 3 has z3 = [0.4, 0.6, 0.2]. Compute the link prediction score using dot product.

score
Show derivation
score = z0 · z3 = 0.5×0.4 + 0.5×0.6 + 0.3×0.2
= 0.20 + 0.30 + 0.06 = 0.56

The dot product measures embedding similarity. Higher score = more likely edge. To convert to a probability, pass through sigmoid: σ(0.56) = 0.636. Training uses binary cross-entropy: existing edges are positive examples, sampled non-edges are negatives.

Exercise 7.5: AUC Computation Derive

You have 3 positive pairs (existing edges) with scores [0.8, 0.6, 0.5] and 3 negative pairs (non-edges) with scores [0.4, 0.3, 0.7]. AUC = fraction of (positive, negative) pairs where the positive has a higher score. Compute AUC.

AUC
Show derivation
Total pairs: 3 × 3 = 9
Pos 0.8 vs Neg [0.4, 0.3, 0.7]: wins 3/3
Pos 0.6 vs Neg [0.4, 0.3, 0.7]: wins 2/3 (loses to 0.7)
Pos 0.5 vs Neg [0.4, 0.3, 0.7]: wins 2/3 (loses to 0.7)
AUC = (3 + 2 + 2) / 9 = 7/9 ≈ 0.778

The negative pair with score 0.7 is causing problems — it's ranked higher than two positive pairs. AUC = 0.5 is random; AUC = 1.0 is perfect separation. 0.778 is decent but shows the model is confusing some non-edges for edges. State-of-the-art link prediction AUCs on citation networks are ~0.95+.

Exercise 7.6: Find the Bug Debug

This link prediction evaluation code has a data leakage bug. Click the problematic line.

function evaluate(graph, model) {
  // Split edges into train/test
  const [trainEdges, testEdges] = splitEdges(graph, 0.8);
  // Train GNN on full graph (all edges)
  model.train(graph);
  // Get embeddings
  const embeddings = model.embed(graph);
  // Evaluate on test edges
  return computeAUC(embeddings, testEdges);
}
Show explanation

Lines 4-5 (specifically line 4) are the bug. The model is trained on the FULL graph including test edges, then evaluated on those same test edges. This is data leakage — the model has already seen the "test" edges during training via the message passing (they're in the adjacency matrix). The fix: train the GNN only on the subgraph with train edges, removing test edges from the adjacency matrix during training.

Chapter 8: Graph Generation

Generating new graphs that look like examples from a training set — molecular graphs, circuit layouts, social network structures. The key challenge: graphs have no canonical ordering of nodes. The same graph can be represented by N! different adjacency matrices (one per node permutation).

GraphRNN: Autoregressive generation.
1. Choose a node ordering (e.g., BFS from random start)
2. Generate one ROW of the adjacency matrix at a time
3. Each row: use an RNN to predict edges to all previous nodes

For node π(i), predict: A[i][0], A[i][1], ..., A[i][i-1]
// only lower triangle — upper is symmetric

Node ordering matters!
BFS ordering groups nearby nodes together → short-range dependencies.
Random ordering creates long-range dependencies → harder to learn.
N! is the enemy. For n=4 nodes, there are 4! = 24 possible orderings of the same graph. For n=20, it's 20! ≈ 2.4 × 1018. GraphRNN uses BFS ordering to reduce the effective space, but the permutation invariance problem is fundamental. This is why graph generation is much harder than image or text generation.
Exercise 8.1: Node Orderings Derive

How many distinct node orderings (permutations) exist for a graph with n = 4 nodes?

orderings
Show derivation
4! = 4 × 3 × 2 × 1 = 24

Each permutation gives a different adjacency matrix representation of the same graph. If a graph has automorphisms (symmetries), some permutations give identical adjacency matrices, so the number of distinct matrices is 4! / |Aut(G)|. For a graph with no symmetry, all 24 matrices are different.

Exercise 8.2: GraphRNN Sequence Length Derive

In GraphRNN, node i needs to predict edges to nodes 0, 1, ..., i-1. For a graph with n=5 nodes, how many total binary decisions (edge/no-edge) must be made to generate the full adjacency matrix?

decisions
Show derivation
Node 0: 0 decisions (no previous nodes)
Node 1: 1 decision (edge to node 0?)
Node 2: 2 decisions (edges to 0, 1?)
Node 3: 3 decisions
Node 4: 4 decisions
Total = 0 + 1 + 2 + 3 + 4 = 10 = n(n-1)/2 = 5×4/2 = 10

This is exactly the number of entries in the lower triangle of the adjacency matrix — one per possible edge. The sequence length grows quadratically. For n = 100, that's 4,950 decisions. GraphRNN uses a "BFS edge ordering" trick to reduce the effective sequence length by only predicting edges to recent BFS neighbors.

Exercise 8.3: Maximum Edges Derive

What is the maximum number of edges in a simple undirected graph with n = 10 nodes? (No self-loops, no multi-edges.)

max edges
Show derivation
C(n, 2) = n(n-1)/2 = 10 × 9 / 2 = 45

This is the complete graph K10. Real molecular graphs are much sparser (each atom bonds to 1-4 others), so the edge density is typically |E| / C(n,2) < 0.1. Graph generation models exploit this sparsity — most edge predictions should be "no edge."

Exercise 8.4: BFS Ordering Benefit Trace
Why does BFS ordering help GraphRNN compared to random ordering?
Show reasoning

BFS visits nodes layer by layer from a starting node. Node i in BFS order is typically connected only to nodes within the current and previous BFS layer — a narrow window of size M. Instead of predicting i-1 edge decisions, node i only predicts M decisions (M is often 3-5 for sparse graphs). This reduces the effective sequence length from O(n²) to O(nM), making generation tractable for larger graphs.

Exercise 8.5: Graph Isomorphism Trace
Two generated graphs G and G' have identical structure but different node labels. When evaluating a graph generative model, should they count as the same or different?
Show reasoning

Graph generation quality is measured by comparing DISTRIBUTIONS of graph statistics (degree distribution, clustering coefficient distribution, orbit counts) between generated and real graphs, using metrics like MMD (Maximum Mean Discrepancy). These statistics are permutation-invariant — they don't depend on which node is labeled 0, 1, 2, etc. Two isomorphic graphs contribute identically to all these metrics.

Chapter 9: Capstone

You're designing a GNN for molecular property prediction — the quintessential GNN application. Given a molecule (atoms = nodes, bonds = edges, atom types and bond types as features), predict a property like solubility or toxicity. Every design choice matters: aggregation, depth, readout, expressiveness.

Everything comes together. This chapter tests whether you can synthesize all 9 previous chapters into a working design. You'll choose architectures, count parameters, analyze expressiveness, and debug a pipeline. This is what a GNN practitioner does daily.
Exercise 9.1: Molecule as Graph Trace
Ethanol (C2H5OH) has 9 atoms. As a graph with only heavy atoms (C, C, O) and their bonds: how many nodes and edges?
Show reasoning

Ethanol's heavy-atom skeleton: C—C—O, a simple path of 3 nodes with 2 edges. In practice, hydrogen atoms are often omitted from the graph (they can be inferred from valence rules), reducing graph size significantly. Node features encode atom type (C=6, O=8), formal charge, aromaticity, etc. Edge features encode bond type (single, double, triple, aromatic).

Exercise 9.2: GNN Design Choices Design

Order these steps to build a molecular property prediction pipeline:

?
?
?
?
?
Readout (sum/mean pool all nodes)
Encode atom features to d-dim
MLP for final prediction
K layers of message passing
Parse SMILES to graph
Show solution

Correct order: SMILES → Encode → GNN Layers → Readout → Predict

1. Parse the SMILES string into a molecular graph (atoms, bonds, features)

2. Encode atom features (atomic number, degree, charge, etc.) into d-dimensional vectors

3. Run K layers of GNN message passing to propagate information

4. Readout: pool all node embeddings into a single graph-level vector

5. MLP head maps graph vector to the predicted property (scalar)

Exercise 9.3: Parameter Count Derive

Your molecular GNN: atom embedding (119 atom types, d=64), 3 GCN layers (64→128→128→64), readout (mean pool), MLP head (64→32→1). All layers have biases. Total parameters?

total params
Show derivation
Atom embedding: 119 × 64 = 7,616
GCN layer 1: W[64,128] + b[128] = 8,192 + 128 = 8,320
GCN layer 2: W[128,128] + b[128] = 16,384 + 128 = 16,512
GCN layer 3: W[128,64] + b[64] = 8,192 + 64 = 8,256
Readout: 0 params (just mean pooling)
MLP layer 1: W[64,32] + b[32] = 2,048 + 32 = 2,080
MLP layer 2: W[32,1] + b[1] = 32 + 1 = 33
Total = 7,616 + 8,320 + 16,512 + 8,256 + 2,080 + 33 = 42,817

~43K parameters — tiny by transformer standards! Most GNNs for molecular property prediction are in the 10K-500K parameter range. The inductive bias from graph structure (message passing on molecular topology) compensates for the small parameter budget. Larger is not always better — GNNs are often data-limited (thousands of molecules, not billions of tokens).

Exercise 9.4: WL Test Expressiveness Trace
The Weisfeiler-Leman (WL) graph isomorphism test is the upper bound on GNN expressiveness. A GNN can distinguish two graphs if and only if the WL test can. Which GNN architecture matches WL test power?
Show reasoning

The paper "How Powerful are Graph Neural Networks?" (Xu et al., 2019) proved that GIN with sum aggregation and injective update functions (MLPs) is as powerful as the 1-WL test. Mean and max aggregations lose information: mean can't distinguish {1,1,1} from {1}, and max can't distinguish {1,1,1} from {1,2,1} (if max is the same). Sum is injective on multisets: sum({1,1,1}) = 3 ≠ sum({1}) = 1. GIN exploits this with the update: h(k)v = MLP((1+ε) · h(k-1)v + ∑ h(k-1)u).

Exercise 9.5: Readout Matters Trace
For graph-level prediction (e.g., molecular solubility), you need to aggregate all node embeddings into one graph vector. Which readout is most expressive?
Show reasoning

Sum pooling preserves the most information. Mean pooling loses size: a molecule with 10 carbon atoms and one with 20 carbons (with the same average embedding) would be indistinguishable. Max pooling loses multiplicity. In practice, concatenating sum + mean + max often works best — each captures different information. Set2Set (an LSTM-based readout) is even more expressive but slower.

Exercise 9.6: Implement graphReadout() Build

Given a 2D array of node embeddings (N nodes, d features), implement sum, mean, and max readout. Return an object with all three.

For max: compute element-wise max across all nodes.
Show solution
javascript
function graphReadout(embeddings) {
  const N = embeddings.length, d = embeddings[0].length;
  const sum = Array(d).fill(0);
  const max = Array(d).fill(-Infinity);
  for (let i = 0; i < N; i++) {
    for (let j = 0; j < d; j++) {
      sum[j] += embeddings[i][j];
      if (embeddings[i][j] > max[j]) max[j] = embeddings[i][j];
    }
  }
  const mean = sum.map(v => v / N);
  return { sum, mean, max };
}
Exercise 9.7: Depth vs Width Trace
For molecular graphs (diameter typically 5-10), how many GNN layers should you use?
Show reasoning

3-5 layers is the sweet spot for molecular graphs. With diameter ~5-10, 3-5 layers let every atom "see" every other atom. More layers cause over-smoothing (Chapter 4) without adding information. Empirically, GNN performance on molecular benchmarks peaks at 3-5 layers and degrades beyond that. Compare to transformers: they can use 96+ layers because they have residual connections, layer normalization, and self-attention (not neighborhood averaging).

Exercise 9.8: Find the Bug Debug

This molecular GNN training loop has a subtle bug that causes the model to learn shortcuts instead of molecular structure. Click the buggy line.

function trainStep(molecules, labels, model) {
  const graphs = molecules.map(mol => molToGraph(mol));
  // Sort graphs by size for efficient batching
  graphs.sort((a, b) => a.numNodes - b.numNodes);
  const embeddings = model.forward(graphs);
  const predictions = mlpHead(sumPool(embeddings));
  return mseLoss(predictions, labels);
}
Show explanation

Line 4 is the bug. The code sorts graphs by size but does NOT sort the labels to match. After sorting, graph[i] no longer corresponds to labels[i]. The model learns to correlate molecule size with the target property (a spurious shortcut) instead of learning actual chemical structure. The fix: sort (graph, label) pairs together, or remove the sorting.

The proof of work. If you completed every exercise in this workbook — traced message passing by hand, computed eigenvalues, designed architectures, analyzed expressiveness — you can hold your own in any graph ML discussion. These are the exact computations that GNN researchers and practitioners do daily. "What I cannot create, I do not understand."

Related Lessons

TopicLesson
Transformer fundamentalsTransformer — From Absolute Zero
Transformer math practiceTransformer Math Workbook
Contrastive learningContrastive CLIP — From Absolute Zero
RL on graphsRL Algorithms — From Absolute Zero