Graph Neural Networks Workbook

Chapter 0: Graph Basics

Before touching neural networks, you need to be fluent in the language of graphs. Every GNN algorithm operates on adjacency matrices, degree matrices, and local neighborhood structure. If you cannot construct these by hand for a small graph, you have no hope of understanding what a GNN actually computes.

Consider this 5-node undirected graph:

Edges: (0,1), (0,2), (1,2), (1,3), (2,3), (3,4)

Adjacency matrix A (symmetric for undirected):
0 1 2 3 4
0 [ 0 1 1 0 0 ]
1 [ 1 0 1 1 0 ]
2 [ 1 1 0 1 0 ]
3 [ 0 1 1 0 1 ]
4 [ 0 0 0 1 0 ]

Degree matrix D = diag(deg(0), deg(1), ..., deg(4)):
D = diag(2, 3, 3, 3, 1)

The degree is the row sum. For undirected graphs, deg(v) = ∑_u A[v][u]. The degree matrix D is diagonal with D[i][i] = deg(i). These two matrices — A and D — are the foundation of every GNN computation.

Exercise 0.1: Edge Count Derive

How many edges does the 5-node graph above have? (Hint: count 1s in A, then adjust for symmetry.)

edges

Show derivation

Number of 1s in A = 12

Each undirected edge appears twice (A[i][j] and A[j][i])

Edges = 12 / 2 = 6

Equivalently, |E| = ½ ∑_i deg(i) = ½(2 + 3 + 3 + 3 + 1) = 6. The sum of all degrees always equals twice the number of edges — this is the handshaking lemma.

Exercise 0.2: Average Degree Derive

What is the average degree of the 5-node graph?

avg degree

Show derivation

Degrees: [2, 3, 3, 3, 1]

Average = (2 + 3 + 3 + 3 + 1) / 5 = 12 / 5 = 2.4

Equivalently, average degree = 2|E| / |V| = 2 × 6 / 5 = 2.4. For sparse real-world graphs (social networks, molecules), the average degree is typically O(1) even as |V| grows to millions.

Exercise 0.3: Clustering Coefficient Derive

Compute the local clustering coefficient for node 1. The formula is C(v) = 2 × (edges among neighbors) / (deg(v) × (deg(v) - 1)). Node 1's neighbors are {0, 2, 3}.

Count how many edges exist among {0, 2, 3}, then plug into the formula.

clustering coeff

Show derivation

Neighbors of node 1: {0, 2, 3}

Edges among neighbors: (0,2) ✓, (0,3) ✗, (2,3) ✓ → 2 edges

C(1) = 2 × 2 / (3 × 2) = 4 / 6 = 0.667

The clustering coefficient measures how close a node's neighbors are to forming a clique. C = 1 means all neighbors are connected to each other. C = 0 means no neighbors are connected. Node 1 has a fairly high clustering coefficient — 2 out of 3 possible neighbor-pairs are connected.

Exercise 0.4: A² Interpretation Trace

If you compute A² (adjacency matrix squared), what does entry A²[i][j] represent?

Whether there's an edge between i and j The number of paths of length 2 from node i to node j The shortest distance from i to j The degree of the common neighbor of i and j

Show explanation

A²[i][j] = ∑_k A[i][k] × A[k][j]

Each term A[i][k] × A[k][j] = 1 only if there's an edge i→k AND k→j — a length-2 path through k. The sum counts all such paths. More generally, Aⁿ[i][j] counts paths of length n. This is why matrix powers of A encode multi-hop connectivity — and why GNN layers (which multiply by A) aggregate information from k-hop neighborhoods after k layers.

Exercise 0.5: A²[0][3] Derive

Using the adjacency matrix above, compute A²[0][3]. That is, how many 2-hop paths exist from node 0 to node 3?

paths

Show derivation

A²[0][3] = ∑_k A[0][k] × A[k][3]

k=0: A[0][0]×A[0][3] = 0×0 = 0

k=1: A[0][1]×A[1][3] = 1×1 = 1 // path 0→1→3

k=2: A[0][2]×A[2][3] = 1×1 = 1 // path 0→2→3

k=3: A[0][3]×A[3][3] = 0×0 = 0

k=4: A[0][4]×A[4][3] = 0×1 = 0

A²[0][3] = 0 + 1 + 1 + 0 + 0 = 2

Two 2-hop paths: 0→1→3 and 0→2→3. Node 0 has no direct edge to node 3 (A[0][3] = 0), but after 2 hops, there are 2 ways to get there. This is exactly the information a 2-layer GNN would aggregate.

Exercise 0.6: Implement degreeMatrix() Build

Given an adjacency matrix as a 2D array, return the degree matrix (also 2D, diagonal).

D[i][i] = sum of row i of A. All off-diagonal entries are 0.

Show solution

javascript
function degreeMatrix(A) {
  const n = A.length;
  const D = Array.from({length: n}, () => Array(n).fill(0));
  for (let i = 0; i < n; i++) {
    D[i][i] = A[i].reduce((s, v) => s + v, 0);
  }
  return D;
}

Chapter 2: GNN Message Passing

The fundamental operation in any GNN is message passing: each node collects features from its neighbors, aggregates them, and updates its own representation. One layer of a Graph Convolutional Network (GCN) computes:

GCN layer (simplified):
H^(l+1) = σ(D̄^-1 Ã H^(l) W^(l))

Where:
Ã = A + I // adjacency + self-loops
D̄ = degree matrix of Ã // D̄[i][i] = ∑_j Ã[i][j]
H^(l): [N, d_in] // node features at layer l
W^(l): [d_in, d_out] // learnable weight matrix
σ: nonlinearity (ReLU)

What D̄^-1Ã does: normalizes each row by the degree.
Row i of D̄^-1Ã = [Ã[i][0]/deg(i), Ã[i][1]/deg(i), ...]
So the update for node i is: average of its own + neighbors' features.

Message passing = neighborhood averaging. D^-1A replaces each node's features with the mean of its neighbors' features. Adding self-loops (A + I) ensures the node also keeps its own features in the average. The weight matrix W then projects these averaged features to learn useful representations.

Let's trace this on a concrete 4-node graph:

4-node path graph: 0 — 1 — 2 — 3
A = [[0,1,0,0],[1,0,1,0],[0,1,0,1],[0,0,1,0]]
Ã = A + I = [[1,1,0,0],[1,1,1,0],[0,1,1,1],[0,0,1,1]]
D̄ = diag(2, 3, 3, 2)
X = [[1,0],[0,1],[1,1],[0,0]] // 4 nodes, 2 features each
W = [[1,0],[0,1]] // identity (for clarity)

Exercise 2.1: Normalized Adjacency Derive

Compute D̄^-1Ã for the 4-node path graph above. What is the entry at row 1, column 2? (Row 1 = node 1's update equation.)

(D̄^-1Ã)[1][2]

Show derivation

D̄^-1 = diag(1/2, 1/3, 1/3, 1/2)

Row 1 of Ã = [1, 1, 1, 0]

Row 1 of D̄^-1Ã = [1/3, 1/3, 1/3, 0]

(D̄^-1Ã)[1][2] = 1/3 ≈ 0.333

Node 1 has degree 3 in Ã (neighbors 0, 2, plus self-loop). So each neighbor's feature contributes 1/3 to the average. This is the "mean aggregation" that defines GCN.

Exercise 2.2: One GCN Layer on Node 1 Derive

Using D̄^-1Ã from above and X = [[1,0],[0,1],[1,1],[0,0]], W = I (identity), compute the output feature for node 1 after one GCN layer (before ReLU). What is the first component?

h'₁ = (D̄^-1Ã)[1,:] · X · W. With W = I, this is just the weighted average of neighbor features.

h'₁[0]

Show derivation

h'₁ = 1/3 × x₀ + 1/3 × x₁ + 1/3 × x₂ + 0 × x₃

= 1/3 × [1,0] + 1/3 × [0,1] + 1/3 × [1,1]

= [1/3, 0] + [0, 1/3] + [1/3, 1/3]

= [2/3, 2/3] = [0.667, 0.667]

Node 1 averages its own features [0,1] with neighbors 0's [1,0] and 2's [1,1]. The output [0.667, 0.667] is the centroid of those three feature vectors. After ReLU (which is a no-op since both values are positive), this becomes node 1's new representation.

Exercise 2.3: Full Layer Output Derive

Compute the full output H' = D̄^-1ÃX (with W=I, before ReLU) for all 4 nodes. What is h'₃[0] (node 3, first feature)?

h'₃[0]

Show derivation

h'₀ = 1/2 × [1,0] + 1/2 × [0,1] = [0.5, 0.5]

h'₁ = 1/3 × [1,0] + 1/3 × [0,1] + 1/3 × [1,1] = [0.667, 0.667]

h'₂ = 1/3 × [0,1] + 1/3 × [1,1] + 1/3 × [0,0] = [0.333, 0.667]

h'₃ = 1/2 × [1,1] + 1/2 × [0,0] = [0.5, 0.5]

Notice: nodes 0 and 3 have the same output [0.5, 0.5] even though they started with different features! Both are degree-1 nodes at the ends of a path, so they average their own features with their one neighbor's. After one layer, symmetrical positions in the graph get similar representations. This is a preview of the over-smoothing problem we'll study in Chapter 4.

Exercise 2.4: Why Self-Loops? Trace

What goes wrong if we use D^-1A instead of D̄^-1Ã (no self-loops)?

The computation becomes more expensive The output has the wrong dimension A node's own features are excluded from its update — it only sees neighbors, losing its own information The adjacency matrix is no longer symmetric

Show explanation

Without self-loops, D^-1A averages ONLY the neighbors' features, throwing away the node's own features entirely. After one layer, node 0 with features [1,0] would get h' = x₁ = [0,1] (its only neighbor). Its own identity is completely lost. Adding I to A ensures each node includes itself in the aggregation.

Exercise 2.5: GCN Parameter Count Derive

A 3-layer GCN with dimensions 32 → 64 → 64 → 16 (input features → hidden → hidden → output classes). Each layer has W^(l) (no bias). How many learnable parameters total?

parameters

Show derivation

W⁽⁰⁾: [32, 64] = 2,048 params

W⁽¹⁾: [64, 64] = 4,096 params

W⁽²⁾: [64, 16] = 1,024 params

Total = 2,048 + 4,096 + 1,024 = 7,168

Notice: the parameter count does NOT depend on the number of nodes or edges! The same W is applied to every node. This is why GNNs generalize across graphs of different sizes — unlike MLPs, which would need N × d_in parameters in the first layer alone.

Exercise 2.6: Find the Bug Debug

This GCN forward pass has a bug. Node features are exploding after a few layers. Click the buggy line.

function gcnLayer(A, X, W) {
  const n = A.length;
  // Add self-loops
  const At = A.map((row,i) => row.map((v,j) => i===j ? v+1 : v));
  // Compute degree
  const deg = At.map(row => row.reduce((s,v) => s+v, 0));
  // Multiply At * X (no normalization!)
  const AX = matmul(At, X);
  // Project and apply ReLU
  return relu(matmul(AX, W));
}

Show explanation

Line 7 is the bug. The code computes ÃX but never normalizes by D̄^-1. It computes the degree on line 6 but never uses it! Without D̄^-1 normalization, each node sums its neighbors' features instead of averaging. High-degree nodes accumulate huge values, causing features to explode exponentially with each layer.

The fix: divide each row of AX by the corresponding degree. AX[i] = AX[i].map(v => v / deg[i]).

Chapter 3: GCN vs GraphSAGE vs GAT

Three GNN architectures dominate practice. They all follow the message passing framework but differ in HOW they aggregate neighbor information. Let's trace one layer of each on the same graph to see the difference.

Same 4-node graph: 0 — 1 — 2 — 3
X = [[1,0],[0,1],[1,1],[0,0]], focusing on node 1 (neighbors: {0, 2})

GCN: h'_v = σ( W · MEAN({x_u : u ∈ N(v) ∪ {v}}) )
GraphSAGE: h'_v = σ( W · CONCAT(x_v, AGG({x_u : u ∈ N(v)})) )
GAT: h'_v = σ( ∑_{u ∈ N(v) ∪ {v}} α_vu · W · x_u )

GAT attention: α_vu = softmax_u( LeakyReLU( a^T [Wx_v || Wx_u] ) )

Three flavors of aggregation. GCN: fixed mean (all neighbors equal). GraphSAGE: separates self from neighbors, supports sampling. GAT: learned attention weights (some neighbors matter more). The choice depends on your graph — homogeneous graphs favor GCN, heterogeneous favor GAT.

Exercise 3.1: GCN for Node 2 Derive

Compute GCN output for node 2 (with self-loop). N(2) ∪ {2} = {1, 2, 3}. Features: x₁=[0,1], x₂=[1,1], x₃=[0,0]. W = I. What is h'₂[1] (second component, before ReLU)?

h'₂[1]

Show derivation

h'₂ = MEAN(x₁, x₂, x₃) = ([0,1] + [1,1] + [0,0]) / 3

= [1/3, 2/3] = [0.333, 0.667]

With W = I, GCN simply averages the features. h'₂[1] = 0.667. The mean operation treats all neighbors equally — node 3 with its "empty" features [0,0] dilutes the average just as much as informative node 1.

Exercise 3.2: GraphSAGE for Node 2 Derive

GraphSAGE concatenates the node's own features with the aggregated neighbor features. For node 2, neighbors = {1, 3}. Using mean aggregation and W = I (applied after concatenation to [d_self + d_neighbors] → d_out). What is the concatenated vector before W is applied?

[0.333, 0.667] — mean of all including self [1, 1, 0, 0.5] — concat(x₂, mean(x₁, x₃)) [0, 0.5, 1, 1] — concat(mean(x₁, x₃), x₂) [1, 1] — just the node's own features

Show derivation

Self features: x₂ = [1, 1]

Neighbor aggregate: MEAN(x₁, x₃) = MEAN([0,1], [0,0]) = [0, 0.5]

CONCAT(x₂, AGG) = [1, 1, 0, 0.5]

The key difference from GCN: the node's OWN features are kept separate from the neighbors' aggregate. This means the weight matrix W can learn different transformations for self-information vs. neighborhood information. The concatenated vector has dimension 2d (4 in this case), and W maps it back to d_out.

Exercise 3.3: GAT Attention Weights Derive

In GAT, attention for node 2 over neighbors {1, 2(self), 3} with W = I: attention score e_vu = a^T[Wx_v || Wx_u]. Let a = [0.2, 0.3, -0.1, 0.4] (length 2d=4). Compute e_2,1 (attention score from node 2 to node 1, before softmax).

e_2,1

Show derivation

Wx₂ = x₂ = [1, 1] (since W = I)

Wx₁ = x₁ = [0, 1]

[Wx₂ || Wx₁] = [1, 1, 0, 1]

e_2,1 = a^T · [1, 1, 0, 1] = 0.2×1 + 0.3×1 + (-0.1)×0 + 0.4×1 = 0.9

Wait — let me recompute carefully:

0.2 × 1 + 0.3 × 1 + (-0.1) × 0 + 0.4 × 1 = 0.2 + 0.3 + 0 + 0.4 = 0.9

Hmm, I stated 0.4 as the answer. Let me re-examine the setup. With a = [0.2, 0.3, -0.1, 0.4]:

e_2,1 = 0.2(1) + 0.3(1) + (-0.1)(0) + 0.4(1) = 0.9

The raw attention score is 0.9. After LeakyReLU (positive value, so unchanged) and softmax over all neighbors, this becomes a normalized weight. The key insight: GAT learns a to assign different importance to different neighbors based on their features.

Exercise 3.4: Compare Aggregations Trace

A social network has a celebrity node with 10,000 followers and a regular user with 5 friends. Which aggregation handles this degree imbalance best?

GCN (mean) — all methods handle it the same way GCN (sum) — high-degree nodes should accumulate more information GraphSAGE with sampling — samples a fixed number of neighbors (e.g., 25), so the celebrity's computation is bounded GAT — attention will automatically zero out most neighbors

Show reasoning

GraphSAGE's neighbor sampling is the key innovation for scalability. Instead of aggregating all 10,000 neighbors, it samples a fixed k (typically 10-25). This makes computation cost per node O(k) regardless of degree. GCN's mean aggregation still requires touching all neighbors, and GAT needs to compute attention scores for all neighbors. In practice, mini-batch training with sampling (as in GraphSAGE and later DGL/PyG implementations) is essential for large graphs.

Exercise 3.5: GAT Multi-Head Derive

A GAT layer uses K=8 attention heads, each with d_out=8. Intermediate heads are concatenated, final head is averaged. How many output features after: (a) an intermediate layer? (b) the final layer?

intermediate output dim

Show derivation

Intermediate: CONCAT(head₁, ..., head₈) → 8 × 8 = 64 features

Final: AVG(head₁, ..., head₈) → 8 features

The GAT paper uses concatenation for hidden layers (to preserve expressiveness) but averaging for the output layer (to produce a fixed-size output for classification). This is analogous to multi-head attention in transformers, where heads are concatenated then projected.

Chapter 4: Over-Smoothing

You just built a 10-layer GCN and... it performs worse than a 2-layer one. What happened? Each GCN layer averages a node's features with its neighbors. After k layers, each node's representation is influenced by all nodes within k hops. On small-diameter graphs, a few layers are enough to mix ALL nodes' features together, making every representation converge to the same vector. This is over-smoothing.

After k layers of GCN (simplified, no nonlinearity):
H^(k) = (D̄^-1Ã)^k X W_eff

The matrix (D̄^-1Ã)^k is a k-step random walk matrix.
As k → ∞, each row converges to the stationary distribution:
π_i = deg(i) / 2|E| for undirected graphs.

Result: all rows become identical → all node features identical → useless.

More layers ≠ better. Unlike transformers where 96 layers is common, GNNs typically use 2-4 layers. The graph diameter determines when over-smoothing kicks in. For a graph with diameter d, a d-layer GNN already connects every node to every other node.

Exercise 4.1: Path Graph Diameter Derive

For a path graph P₅ (5 nodes in a line: 0-1-2-3-4), what is the diameter? (Diameter = longest shortest path between any two nodes.)

diameter

Show derivation

Shortest path from 0 to 4 = 4 edges (0→1→2→3→4)

This is the maximum over all pairs → diameter = 4

On this graph, a 4-layer GCN lets every node see every other node. After that, additional layers only smooth further. For real-world social networks with diameter ~6 (six degrees of separation), even a 3-layer GNN covers most of the graph.

Exercise 4.2: Laplacian of P₃ Derive

The graph Laplacian is L = D - A. For the path graph P₃ (nodes 0-1-2): compute L and find its eigenvalues. (Hint: L is a 3×3 symmetric matrix.)

The smallest eigenvalue of any graph Laplacian is always 0 (eigenvector = all-ones). The second-smallest eigenvalue λ₂ is the algebraic connectivity (Fiedler value).

λ₂ (second eigenvalue)

Show derivation

P₃: A = [[0,1,0],[1,0,1],[0,1,0]], D = diag(1,2,1)

L = D - A = [[1,-1,0],[-1,2,-1],[0,-1,1]]

det(L - λI) = 0

(1-λ)[(2-λ)(1-λ) - 1] - (-1)[(-1)(1-λ)] = 0

(1-λ)(2 - 3λ + λ² - 1) + (-(1-λ)) = 0

(1-λ)(λ² - 3λ + 1) - (1-λ) = 0

(1-λ)(λ² - 3λ) = 0

λ(1-λ)(λ - 3) = 0

Eigenvalues: λ = 0, 1, 3

λ₂ = 1. This measures how well-connected the graph is. For the complete graph K₃, λ₂ = 3 (maximally connected). For a disconnected graph, λ₂ = 0. Higher λ₂ means faster mixing, which means faster over-smoothing in GNNs.

Exercise 4.3: Smoothing Rate Trace

You run a GCN on two graphs: (A) a star graph (one center connected to all others) with diameter 2, and (B) a chain graph with diameter 20. Which over-smooths faster?

Graph A (star) — all nodes are within 2 hops of each other, so 2 layers already mix everything Graph B (chain) — longer chains smooth faster due to more total edges Both over-smooth at the same rate Neither — over-smoothing only happens with specific activation functions

Show reasoning

The star graph has diameter 2, so after 2 GCN layers every node's receptive field covers the entire graph. The chain has diameter 20, so you'd need 20 layers to fully connect it. Over-smoothing is directly tied to the spectral gap (λ₂) of the normalized Laplacian — larger gap = faster convergence to the stationary distribution = faster over-smoothing. The star graph has a large spectral gap.

Exercise 4.4: Residual Connections Trace

Which technique does NOT help mitigate over-smoothing?

Residual connections: H^(l+1) = GCN(H^(l)) + H^(l) JK-Net: concatenate outputs from all layers, then project DropEdge: randomly drop edges during training Increasing the hidden dimension — wider layers still converge to the same stationary distribution

Show reasoning

Increasing width does NOT help with over-smoothing. The problem is that (D̄^-1Ã)^k converges to rank-1 regardless of feature dimension. Residual connections preserve the original signal. JK-Net accesses early-layer representations that haven't been smoothed yet. DropEdge slows the mixing by removing random edges each epoch. All three are proven strategies; wider layers are not.

Exercise 4.5: MADGap Metric Derive

The MAD (Mean Average Distance) between node representations measures over-smoothing. After 1 layer on our 4-node path graph, H' = [[0.5,0.5],[0.667,0.667],[0.333,0.667],[0.5,0.5]]. Compute the MAD: average pairwise L2 distance between all node representations.

There are C(4,2) = 6 pairs. Compute L2 distance for each, then average.

MAD

Show derivation

d(0,1) = √((0.667-0.5)² + (0.667-0.5)²) = √(0.028 + 0.028) = 0.236

d(0,2) = √((0.333-0.5)² + (0.667-0.5)²) = √(0.028 + 0.028) = 0.236

d(0,3) = √((0.5-0.5)² + (0.5-0.5)²) = 0

d(1,2) = √((0.333-0.667)² + (0.667-0.667)²) = 0.333

d(1,3) = √((0.5-0.667)² + (0.5-0.667)²) = 0.236

d(2,3) = √((0.5-0.333)² + (0.5-0.667)²) = 0.236

MAD = (0.236 + 0.236 + 0 + 0.333 + 0.236 + 0.236) / 6 = 0.213

The MAD after the initial features X is much higher. As you add layers, MAD decreases toward 0 — that's over-smoothing quantified. When MAD ≈ 0, all node representations are essentially identical and classification becomes impossible.

Chapter 5: Spectral Graph Theory

GCNs are actually an approximation of spectral graph convolution. To understand why, you need the graph Laplacian and its eigenvectors — the "Fourier basis" of the graph.

Graph Laplacian: L = D - A
Normalized Laplacian: L_norm = I - D^-1/2 A D^-1/2
Eigendecomposition: L = U Λ U^T

Spectral convolution:
g_θ * x = U g_θ(Λ) U^T x

Chebyshev approximation (ChebNet):
g_θ * x ≈ ∑_k=0^K θ_k T_k(L̃) x
where L̃ = 2L/λ_max - I, and T_k are Chebyshev polynomials.

GCN simplification (K=1):
g_θ * x ≈ θ(I + D^-1/2 A D^-1/2) x

GCN IS a first-order Chebyshev filter. When Kipf & Welling set K=1 and approximate λ_max ≈ 2, the spectral convolution simplifies to the familiar D̄^-1/2ÃD̄^-1/2 formula. Every GCN layer is literally a low-pass filter on the graph spectrum — it smooths out high-frequency signals (differences between neighbors), which explains over-smoothing.

Exercise 5.1: Laplacian of K₃ Derive

Compute the graph Laplacian L = D - A for the complete graph K₃ (3 nodes, every pair connected). Then find all three eigenvalues.

λ_max

Show derivation

K₃: A = [[0,1,1],[1,0,1],[1,1,0]], D = diag(2,2,2)

L = D - A = [[2,-1,-1],[-1,2,-1],[-1,-1,2]]

det(L - λI) = (2-λ)[(2-λ)² - 1] + 1[-(2-λ) - 1] + (-1)[1 + (2-λ)]

By symmetry of K_n, the Laplacian eigenvalues are known: λ = 0 (multiplicity 1) and λ = n (multiplicity n-1).

For K₃: λ = 0, 3, 3

The eigenvector for λ=0 is always [1,1,...,1]/√n (constant signal). The eigenvectors for λ=3 span the space orthogonal to this — they represent "high frequency" signals where adjacent nodes differ. A GCN (low-pass filter) suppresses these λ=3 components, smoothing towards the constant eigenvector.

Exercise 5.2: Normalized Laplacian Derive

For K₃, compute L_norm = I - D^-1/2AD^-1/2. What is L_norm[0][1]?

D^-1/2 = diag(1/√2, 1/√2, 1/√2) for K₃.

L_norm[0][1]

Show derivation

D^-1/2 A D^-1/2[0][1] = (1/√2) × A[0][1] × (1/√2) = 1 × 1/2 = 0.5

L_norm[0][1] = I[0][1] - 0.5 = 0 - 0.5 = -0.5

The normalized Laplacian has eigenvalues in [0, 2]. For K₃: λ = 0, 3/2, 3/2. The normalization rescales the spectrum into a fixed range, which is why Kipf & Welling approximate λ_max ≈ 2 for the Chebyshev polynomial.

Exercise 5.3: Chebyshev Polynomials Derive

Chebyshev polynomials: T₀(x) = 1, T₁(x) = x, T_k(x) = 2xT_k-1(x) - T_k-2(x). Compute T₂(x) and T₃(x).

What is T₃(x)?

3x² - 1 4x³ - 2x 4x³ - 3x 8x³ - 4x

Show derivation

T₀(x) = 1

T₁(x) = x

T₂(x) = 2x · T₁(x) - T₀(x) = 2x² - 1

T₃(x) = 2x · T₂(x) - T₁(x) = 2x(2x² - 1) - x = 4x³ - 2x - x = 4x³ - 3x

In ChebNet, K=3 means the filter uses T₀(L̃), T₁(L̃), T₂(L̃), T₃(L̃) with learnable coefficients θ₀,...,θ₃. Each T_k(L̃)x involves exactly k hops of neighborhood aggregation, so K controls the receptive field size. GCN (K=1) only uses T₀ and T₁.

Exercise 5.4: Spectral Complexity Trace

Full spectral convolution requires eigendecomposing L, which costs O(N³). How does ChebNet avoid this?

It uses a smaller graph Chebyshev polynomials T_k(L)x can be computed via the recurrence using only sparse matrix-vector products (L is sparse), costing O(K|E|) It approximates eigenvalues using random projections It computes eigenvalues only for a subgraph

Show reasoning

The Chebyshev recurrence T_k(L)x = 2L · T_k-1(L)x - T_k-2(L)x only needs matrix-vector multiplies with L. Since L is sparse (|E| nonzeros), each multiply is O(|E|). K iterations cost O(K|E|), which is linear in graph size. No eigendecomposition needed. This is the fundamental insight that made spectral GNNs practical.

Exercise 5.5: GCN as Low-Pass Filter Derive

The GCN filter in the spectral domain is g(λ) = 1 - λ (for the normalized Laplacian). For eigenvalues λ = 0 and λ = 1.5 of K₃, what are the filter responses g(0) and g(1.5)? Which component is amplified?

g(1.5)

Show derivation

g(0) = 1 - 0 = 1 // low-frequency: fully preserved

g(1.5) = 1 - 1.5 = -0.5 // high-frequency: attenuated (and flipped)

The low-frequency component (λ = 0, the "constant" signal) has gain 1 — perfectly preserved. The high-frequency component (λ = 1.5) has gain -0.5 — attenuated by half. After k layers, this becomes (-0.5)^k, which rapidly goes to 0. This IS over-smoothing, viewed through the spectral lens: high-frequency information (node differences) is exponentially suppressed.

Chapter 6: Knowledge Graphs

Knowledge graphs encode facts as (head, relation, tail) triples — (Einstein, born_in, Germany). The challenge: embed entities and relations into continuous vector spaces so that the geometric structure reflects semantic relationships. Two classic approaches:

TransE: h + r ≈ t
Score: f(h, r, t) = -||h + r - t||
Higher score (less negative) = more plausible triple.

DistMult: Score = ∑_i h_i · r_i · t_i
Symmetric: score(h, r, t) = score(t, r, h)

ComplEx: Re(∑_i h_i · r_i · t̄_i)
Complex-valued embeddings. Handles asymmetric relations.

TransE = translation in embedding space. "born_in" is a vector r. Adding r to Einstein's embedding should land near Germany's embedding. If it does, the fact is plausible. Training pushes h+r close to t for real triples and away for fake ones. The beauty: similar relations get similar vectors, so the model can generalize to unseen triples.

Exercise 6.1: TransE Score Derive

Given embeddings: h = [1, 2, 0], r = [1, -1, 1], t = [2, 1, 1]. Compute the TransE score f(h, r, t) = -||h + r - t||₂.

score

Show derivation

h + r = [1+1, 2+(-1), 0+1] = [2, 1, 1]

h + r - t = [2-2, 1-1, 1-1] = [0, 0, 0]

||h + r - t||₂ = 0

f(h, r, t) = -0 = 0

A perfect score of 0 means the translation h + r lands exactly on t. In practice, scores are slightly negative due to imperfect embeddings. The loss function (margin-based) pushes correct triples toward 0 and incorrect ones below -γ (the margin).

Exercise 6.2: DistMult Score Derive

Same embeddings: h = [1, 2, 0], r = [1, -1, 1], t = [2, 1, 1]. Compute the DistMult score = ∑_i h_i · r_i · t_i.

score

Show derivation

i=0: 1 × 1 × 2 = 2

i=1: 2 × (-1) × 1 = -2

i=2: 0 × 1 × 1 = 0

Score = 2 + (-2) + 0 = 0

Both TransE and DistMult give 0 for this triple, but for different geometric reasons. TransE measures translational distance; DistMult measures multiplicative interaction. Note that DistMult(h, r, t) = DistMult(t, r, h) — it's symmetric. This means it can't distinguish "born_in" from "birthplace_of." TransE can, because h + r = t but t + r ≠ h.

Exercise 6.3: TransE Limitation Trace

TransE models h + r ≈ t. What type of relation can it NOT model well?

Symmetric relations (married_to) 1-to-N relations (born_in with multiple people born in the same city) — forces all h entities to have the same embedding Long-range relations Relations between entities in different categories

Show reasoning

If (Einstein, born_in, Germany) and (Hilbert, born_in, Germany) are both true, then h_Einstein + r ≈ t_Germany and h_Hilbert + r ≈ t_Germany. This forces h_Einstein ≈ h_Hilbert. For 1-to-N relations, all head entities sharing the same (relation, tail) get collapsed to the same point. Symmetric relations are also problematic: h + r = t and t + r = h forces r = 0. RotatE and ComplEx address these issues using rotations/complex numbers.

Exercise 6.4: Negative Sampling Derive

In TransE training with margin γ = 1, you have positive triple (h, r, t) with score -0.3 and negative triple (h, r, t') with score -1.5. Compute the margin loss: max(0, γ + f_pos - f_neg). Should the model update?

loss

Show derivation

Loss = max(0, 1 + (-0.3) - (-1.5))

= max(0, 1 - 0.3 + 1.5) = max(0, 2.2)

Wait — the TransE score for the positive should be HIGHER (closer to 0) than the negative. Let me reconsider the formula. The standard margin loss is:

L = max(0, γ + d_pos - d_neg)

where d = ||h + r - t|| (distance, not score). So d_pos = 0.3, d_neg = 1.5:

L = max(0, 1 + 0.3 - 1.5) = max(0, -0.2) = 0

Actually, the answer depends on whether we use score or distance. Using the negative score formulation where we want score_pos > score_neg + γ:

L = max(0, γ - score_pos + score_neg)

= max(0, 1 - (-0.3) + (-1.5)) = max(0, 1 + 0.3 - 1.5) = max(0, -0.2) = 0

Loss = 0. No update needed. The positive triple already scores higher than the negative by more than the margin. The model has successfully separated this pair.

Exercise 6.5: KG Embedding Parameters Derive

A knowledge graph has 15,000 entities, 237 relations, and you use TransE with embedding dimension d = 200. How many parameters total?

million params

Show derivation

Entity embeddings: 15,000 × 200 = 3,000,000

Relation embeddings: 237 × 200 = 47,400

Total = 3,047,400 ≈ 3.05M

The entity embeddings dominate. This is the FB15k-237 benchmark scale. Larger KGs like Wikidata (100M+ entities) at d=200 would need 20B+ parameters just for entity embeddings — this is why KG embedding at web scale requires careful engineering (e.g., PBG by Facebook uses distributed training).

Chapter 7: Link Prediction

Given a graph with some edges, predict which edges are missing. This is the bread-and-butter task of graph ML — friend recommendation, drug-target interaction, paper citation prediction. We'll cover both classical heuristics and GNN-based approaches.

Classical heuristics for nodes u, v:
Common Neighbors: CN(u,v) = |N(u) ∩ N(v)|
Jaccard: J(u,v) = |N(u) ∩ N(v)| / |N(u) ∪ N(v)|
Adamic-Adar: AA(u,v) = ∑_{w ∈ N(u) ∩ N(v)} 1 / log(deg(w))

GNN-based:
score(u,v) = z_u^T · z_v // dot product of learned embeddings

Adamic-Adar penalizes popular common neighbors. A common neighbor with degree 1000 (a celebrity) is less informative than one with degree 5 (a niche interest). The 1/log(deg) weighting captures this — rare shared neighbors matter more. This simple heuristic is surprisingly hard for GNNs to beat on many benchmarks.

Using our 5-node graph: edges (0,1), (0,2), (1,2), (1,3), (2,3), (3,4).

Exercise 7.1: Common Neighbors Derive

Compute CN(0, 3) — the number of common neighbors of node 0 and node 3.

common neighbors

Show derivation

N(0) = {1, 2}

N(3) = {1, 2, 4}

N(0) ∩ N(3) = {1, 2}

CN(0, 3) = 2

Two common neighbors suggest a likely edge between 0 and 3. In social networks, this is "friends of friends" — the more mutual friends two people have, the more likely they should be connected.

Exercise 7.2: Jaccard Coefficient Derive

Compute J(0, 3) = |N(0) ∩ N(3)| / |N(0) ∪ N(3)|.

Jaccard

Show derivation

N(0) ∪ N(3) = {1, 2, 4}

J(0, 3) = |{1, 2}| / |{1, 2, 4}| = 2/3 ≈ 0.667

Jaccard normalizes by the union. This prevents high-degree nodes from dominating: CN(celebrity, anyone) is large simply because the celebrity has many neighbors. Jaccard says "what fraction of their combined connections do they share?" — a more meaningful similarity measure.

Exercise 7.3: Adamic-Adar Derive

Compute AA(0, 3). Common neighbors are {1, 2} with degrees deg(1) = 3, deg(2) = 3.

Adamic-Adar

Show derivation

AA(0, 3) = 1/log(deg(1)) + 1/log(deg(2))

= 1/log(3) + 1/log(3)

= 1/1.099 + 1/1.099 = 0.910 + 0.910 = 1.82

If the common neighbors had degree 100 instead: AA = 2/log(100) = 2/4.605 = 0.434 — much lower. High-degree common neighbors contribute less. Adamic-Adar consistently outperforms CN and Jaccard on real-world link prediction benchmarks.

Exercise 7.4: GNN Link Score Derive

After running a GNN, node 0 has embedding z₀ = [0.5, 0.5, 0.3] and node 3 has z₃ = [0.4, 0.6, 0.2]. Compute the link prediction score using dot product.

score

Show derivation

score = z₀ · z₃ = 0.5×0.4 + 0.5×0.6 + 0.3×0.2

= 0.20 + 0.30 + 0.06 = 0.56

The dot product measures embedding similarity. Higher score = more likely edge. To convert to a probability, pass through sigmoid: σ(0.56) = 0.636. Training uses binary cross-entropy: existing edges are positive examples, sampled non-edges are negatives.

Exercise 7.5: AUC Computation Derive

You have 3 positive pairs (existing edges) with scores [0.8, 0.6, 0.5] and 3 negative pairs (non-edges) with scores [0.4, 0.3, 0.7]. AUC = fraction of (positive, negative) pairs where the positive has a higher score. Compute AUC.

AUC

Show derivation

Total pairs: 3 × 3 = 9

Pos 0.8 vs Neg [0.4, 0.3, 0.7]: wins 3/3

Pos 0.6 vs Neg [0.4, 0.3, 0.7]: wins 2/3 (loses to 0.7)

Pos 0.5 vs Neg [0.4, 0.3, 0.7]: wins 2/3 (loses to 0.7)

AUC = (3 + 2 + 2) / 9 = 7/9 ≈ 0.778

The negative pair with score 0.7 is causing problems — it's ranked higher than two positive pairs. AUC = 0.5 is random; AUC = 1.0 is perfect separation. 0.778 is decent but shows the model is confusing some non-edges for edges. State-of-the-art link prediction AUCs on citation networks are ~0.95+.

Exercise 7.6: Find the Bug Debug

This link prediction evaluation code has a data leakage bug. Click the problematic line.

function evaluate(graph, model) {
  // Split edges into train/test
  const [trainEdges, testEdges] = splitEdges(graph, 0.8);
  // Train GNN on full graph (all edges)
  model.train(graph);
  // Get embeddings
  const embeddings = model.embed(graph);
  // Evaluate on test edges
  return computeAUC(embeddings, testEdges);
}

Show explanation

Lines 4-5 (specifically line 4) are the bug. The model is trained on the FULL graph including test edges, then evaluated on those same test edges. This is data leakage — the model has already seen the "test" edges during training via the message passing (they're in the adjacency matrix). The fix: train the GNN only on the subgraph with train edges, removing test edges from the adjacency matrix during training.

Chapter 8: Graph Generation

Generating new graphs that look like examples from a training set — molecular graphs, circuit layouts, social network structures. The key challenge: graphs have no canonical ordering of nodes. The same graph can be represented by N! different adjacency matrices (one per node permutation).

GraphRNN: Autoregressive generation.
1. Choose a node ordering (e.g., BFS from random start)
2. Generate one ROW of the adjacency matrix at a time
3. Each row: use an RNN to predict edges to all previous nodes

For node π(i), predict: A[i][0], A[i][1], ..., A[i][i-1]
// only lower triangle — upper is symmetric

Node ordering matters!
BFS ordering groups nearby nodes together → short-range dependencies.
Random ordering creates long-range dependencies → harder to learn.

N! is the enemy. For n=4 nodes, there are 4! = 24 possible orderings of the same graph. For n=20, it's 20! ≈ 2.4 × 10¹⁸. GraphRNN uses BFS ordering to reduce the effective space, but the permutation invariance problem is fundamental. This is why graph generation is much harder than image or text generation.

Exercise 8.1: Node Orderings Derive

How many distinct node orderings (permutations) exist for a graph with n = 4 nodes?

orderings

Show derivation

4! = 4 × 3 × 2 × 1 = 24

Each permutation gives a different adjacency matrix representation of the same graph. If a graph has automorphisms (symmetries), some permutations give identical adjacency matrices, so the number of distinct matrices is 4! / |Aut(G)|. For a graph with no symmetry, all 24 matrices are different.

Exercise 8.2: GraphRNN Sequence Length Derive

In GraphRNN, node i needs to predict edges to nodes 0, 1, ..., i-1. For a graph with n=5 nodes, how many total binary decisions (edge/no-edge) must be made to generate the full adjacency matrix?

decisions

Show derivation

Node 0: 0 decisions (no previous nodes)

Node 1: 1 decision (edge to node 0?)

Node 2: 2 decisions (edges to 0, 1?)

Node 3: 3 decisions

Node 4: 4 decisions

Total = 0 + 1 + 2 + 3 + 4 = 10 = n(n-1)/2 = 5×4/2 = 10

This is exactly the number of entries in the lower triangle of the adjacency matrix — one per possible edge. The sequence length grows quadratically. For n = 100, that's 4,950 decisions. GraphRNN uses a "BFS edge ordering" trick to reduce the effective sequence length by only predicting edges to recent BFS neighbors.

Exercise 8.3: Maximum Edges Derive

What is the maximum number of edges in a simple undirected graph with n = 10 nodes? (No self-loops, no multi-edges.)

max edges

Show derivation

C(n, 2) = n(n-1)/2 = 10 × 9 / 2 = 45

This is the complete graph K₁₀. Real molecular graphs are much sparser (each atom bonds to 1-4 others), so the edge density is typically |E| / C(n,2) < 0.1. Graph generation models exploit this sparsity — most edge predictions should be "no edge."

Exercise 8.4: BFS Ordering Benefit Trace

Why does BFS ordering help GraphRNN compared to random ordering?

BFS produces unique orderings (no permutation ambiguity) In BFS order, node i is only connected to nodes within a small window behind it, so the RNN only needs to predict a short vector instead of all i-1 previous nodes BFS makes the graph smaller BFS ordering is faster to compute

Show reasoning

BFS visits nodes layer by layer from a starting node. Node i in BFS order is typically connected only to nodes within the current and previous BFS layer — a narrow window of size M. Instead of predicting i-1 edge decisions, node i only predicts M decisions (M is often 3-5 for sparse graphs). This reduces the effective sequence length from O(n²) to O(nM), making generation tractable for larger graphs.

Exercise 8.5: Graph Isomorphism Trace

Two generated graphs G and G' have identical structure but different node labels. When evaluating a graph generative model, should they count as the same or different?

Same — we care about graph isomorphism (structure), not node identity. Evaluation metrics like MMD on graph statistics (degree distribution, clustering, orbit counts) are permutation-invariant Different — node labels matter for generation It depends on the GAN discriminator

Show reasoning

Graph generation quality is measured by comparing DISTRIBUTIONS of graph statistics (degree distribution, clustering coefficient distribution, orbit counts) between generated and real graphs, using metrics like MMD (Maximum Mean Discrepancy). These statistics are permutation-invariant — they don't depend on which node is labeled 0, 1, 2, etc. Two isomorphic graphs contribute identically to all these metrics.

Topic	Lesson
Transformer fundamentals	Transformer — From Absolute Zero
Transformer math practice	Transformer Math Workbook
Contrastive learning	Contrastive CLIP — From Absolute Zero
RL on graphs	RL Algorithms — From Absolute Zero

Graph Neural Networks

Chapter 0: Graph Basics

Chapter 1: Node Embeddings

Chapter 2: GNN Message Passing

Chapter 3: GCN vs GraphSAGE vs GAT

Chapter 4: Over-Smoothing

Chapter 5: Spectral Graph Theory

Chapter 6: Knowledge Graphs

Chapter 7: Link Prediction

Chapter 8: Graph Generation

Chapter 9: Capstone

Related Lessons