From adjacency matrices to molecular property prediction. Every GNN concept derived by hand with concrete graphs, exact numbers, and working code — no hand-waving.
Before touching neural networks, you need to be fluent in the language of graphs. Every GNN algorithm operates on adjacency matrices, degree matrices, and local neighborhood structure. If you cannot construct these by hand for a small graph, you have no hope of understanding what a GNN actually computes.
Consider this 5-node undirected graph:
How many edges does the 5-node graph above have? (Hint: count 1s in A, then adjust for symmetry.)
Equivalently, |E| = ½ ∑i deg(i) = ½(2 + 3 + 3 + 3 + 1) = 6. The sum of all degrees always equals twice the number of edges — this is the handshaking lemma.
What is the average degree of the 5-node graph?
Equivalently, average degree = 2|E| / |V| = 2 × 6 / 5 = 2.4. For sparse real-world graphs (social networks, molecules), the average degree is typically O(1) even as |V| grows to millions.
Compute the local clustering coefficient for node 1. The formula is C(v) = 2 × (edges among neighbors) / (deg(v) × (deg(v) - 1)). Node 1's neighbors are {0, 2, 3}.
Count how many edges exist among {0, 2, 3}, then plug into the formula.
The clustering coefficient measures how close a node's neighbors are to forming a clique. C = 1 means all neighbors are connected to each other. C = 0 means no neighbors are connected. Node 1 has a fairly high clustering coefficient — 2 out of 3 possible neighbor-pairs are connected.
Each term A[i][k] × A[k][j] = 1 only if there's an edge i→k AND k→j — a length-2 path through k. The sum counts all such paths. More generally, An[i][j] counts paths of length n. This is why matrix powers of A encode multi-hop connectivity — and why GNN layers (which multiply by A) aggregate information from k-hop neighborhoods after k layers.
Using the adjacency matrix above, compute A²[0][3]. That is, how many 2-hop paths exist from node 0 to node 3?
Two 2-hop paths: 0→1→3 and 0→2→3. Node 0 has no direct edge to node 3 (A[0][3] = 0), but after 2 hops, there are 2 ways to get there. This is exactly the information a 2-layer GNN would aggregate.
Given an adjacency matrix as a 2D array, return the degree matrix (also 2D, diagonal).
javascript function degreeMatrix(A) { const n = A.length; const D = Array.from({length: n}, () => Array(n).fill(0)); for (let i = 0; i < n; i++) { D[i][i] = A[i].reduce((s, v) => s + v, 0); } return D; }
Before GNNs existed, the dominant approach was to learn node embeddings using random walks. The idea: if two nodes appear near each other in random walks on the graph, they should have similar embedding vectors. This is the graph equivalent of word2vec — "you shall know a node by the company it keeps."
Using the 5-node graph from Ch 0, starting at node 0 with a uniform random walk: what are the possible nodes at step 1? And from node 1 (if we land there), what are the possible nodes at step 2?
Node 4 is 3 hops from node 0 (0→1→3→4 or 0→2→3→4). This means a walk of length 2 cannot possibly reach it. Longer walks capture more distant structural information — but also blur local patterns.
In Node2Vec with p=1, q=2: we just walked from node 2 to node 1 (so source s=2, current t=1). The neighbors of node 1 are {0, 2, 3}. Compute the normalized transition probabilities to each neighbor.
For each neighbor x of t=1: check d(s=2, x), compute α, then normalize.
Wait — let's re-check. d(s,x) is shortest-path distance in the original graph between s=2 and x.
Hmm, all neighbors of t=1 happen to be direct neighbors of s=2, so α = 1 for all. But for q to matter, we need a neighbor of t that is NOT a neighbor of s. Let's consider a different scenario.
Actually, let's trace from s=0 to t=1 instead, where node 3 is a neighbor of t=1 but d(0,3) = 2:
With q=2, the walk is discouraged from moving to node 3 (which is far from the source). This makes the walk more BFS-like, staying near the source's local neighborhood. With q=0.5, it would be encouraged to explore outward.
Skip-gram takes each node in the walk as a "center word" and pairs it with all nodes within the window. With window=1, node v is paired with u (left context) and w (right context). When u is the center, it's paired with v. When w is the center, it's paired with v. All symmetric pairs are generated.
In practice, negative sampling replaces the expensive softmax. For each positive pair, sample k random "negative" nodes and train a binary classifier to distinguish real context from noise.
Typical embedding dimensions are 64-256. Too small and you can't capture the graph's structure. Too large and you overfit (especially with limited random walks) and waste memory. The embedding table has |V| × d parameters — at d = 10,000 that's 100M parameters just for the lookup table on a 10K-node graph, which is absurdly overparameterized.
Given an adjacency list (array of arrays) and a start node, return a random walk of the specified length. Use uniform random neighbor selection (DeepWalk-style).
javascript function randomWalk(adjList, start, length) { const walk = [start]; let curr = start; for (let i = 0; i < length; i++) { const neighbors = adjList[curr]; curr = neighbors[Math.floor(Math.random() * neighbors.length)]; walk.push(curr); } return walk; }
The fundamental operation in any GNN is message passing: each node collects features from its neighbors, aggregates them, and updates its own representation. One layer of a Graph Convolutional Network (GCN) computes:
Let's trace this on a concrete 4-node graph:
Compute D̄-1Ã for the 4-node path graph above. What is the entry at row 1, column 2? (Row 1 = node 1's update equation.)
Node 1 has degree 3 in à (neighbors 0, 2, plus self-loop). So each neighbor's feature contributes 1/3 to the average. This is the "mean aggregation" that defines GCN.
Using D̄-1Ã from above and X = [[1,0],[0,1],[1,1],[0,0]], W = I (identity), compute the output feature for node 1 after one GCN layer (before ReLU). What is the first component?
h'1 = (D̄-1Ã)[1,:] · X · W. With W = I, this is just the weighted average of neighbor features.
Node 1 averages its own features [0,1] with neighbors 0's [1,0] and 2's [1,1]. The output [0.667, 0.667] is the centroid of those three feature vectors. After ReLU (which is a no-op since both values are positive), this becomes node 1's new representation.
Compute the full output H' = D̄-1ÃX (with W=I, before ReLU) for all 4 nodes. What is h'3[0] (node 3, first feature)?
Notice: nodes 0 and 3 have the same output [0.5, 0.5] even though they started with different features! Both are degree-1 nodes at the ends of a path, so they average their own features with their one neighbor's. After one layer, symmetrical positions in the graph get similar representations. This is a preview of the over-smoothing problem we'll study in Chapter 4.
Without self-loops, D-1A averages ONLY the neighbors' features, throwing away the node's own features entirely. After one layer, node 0 with features [1,0] would get h' = x1 = [0,1] (its only neighbor). Its own identity is completely lost. Adding I to A ensures each node includes itself in the aggregation.
A 3-layer GCN with dimensions 32 → 64 → 64 → 16 (input features → hidden → hidden → output classes). Each layer has W(l) (no bias). How many learnable parameters total?
Notice: the parameter count does NOT depend on the number of nodes or edges! The same W is applied to every node. This is why GNNs generalize across graphs of different sizes — unlike MLPs, which would need N × din parameters in the first layer alone.
This GCN forward pass has a bug. Node features are exploding after a few layers. Click the buggy line.
function gcnLayer(A, X, W) { const n = A.length; // Add self-loops const At = A.map((row,i) => row.map((v,j) => i===j ? v+1 : v)); // Compute degree const deg = At.map(row => row.reduce((s,v) => s+v, 0)); // Multiply At * X (no normalization!) const AX = matmul(At, X); // Project and apply ReLU return relu(matmul(AX, W)); }
Line 7 is the bug. The code computes ÃX but never normalizes by D̄-1. It computes the degree on line 6 but never uses it! Without D̄-1 normalization, each node sums its neighbors' features instead of averaging. High-degree nodes accumulate huge values, causing features to explode exponentially with each layer.
The fix: divide each row of AX by the corresponding degree. AX[i] = AX[i].map(v => v / deg[i]).
Three GNN architectures dominate practice. They all follow the message passing framework but differ in HOW they aggregate neighbor information. Let's trace one layer of each on the same graph to see the difference.
Compute GCN output for node 2 (with self-loop). N(2) ∪ {2} = {1, 2, 3}. Features: x1=[0,1], x2=[1,1], x3=[0,0]. W = I. What is h'2[1] (second component, before ReLU)?
With W = I, GCN simply averages the features. h'2[1] = 0.667. The mean operation treats all neighbors equally — node 3 with its "empty" features [0,0] dilutes the average just as much as informative node 1.
GraphSAGE concatenates the node's own features with the aggregated neighbor features. For node 2, neighbors = {1, 3}. Using mean aggregation and W = I (applied after concatenation to [dself + dneighbors] → dout). What is the concatenated vector before W is applied?
The key difference from GCN: the node's OWN features are kept separate from the neighbors' aggregate. This means the weight matrix W can learn different transformations for self-information vs. neighborhood information. The concatenated vector has dimension 2d (4 in this case), and W maps it back to dout.
In GAT, attention for node 2 over neighbors {1, 2(self), 3} with W = I: attention score evu = aT[Wxv || Wxu]. Let a = [0.2, 0.3, -0.1, 0.4] (length 2d=4). Compute e2,1 (attention score from node 2 to node 1, before softmax).
Wait — let me recompute carefully:
Hmm, I stated 0.4 as the answer. Let me re-examine the setup. With a = [0.2, 0.3, -0.1, 0.4]:
The raw attention score is 0.9. After LeakyReLU (positive value, so unchanged) and softmax over all neighbors, this becomes a normalized weight. The key insight: GAT learns a to assign different importance to different neighbors based on their features.
GraphSAGE's neighbor sampling is the key innovation for scalability. Instead of aggregating all 10,000 neighbors, it samples a fixed k (typically 10-25). This makes computation cost per node O(k) regardless of degree. GCN's mean aggregation still requires touching all neighbors, and GAT needs to compute attention scores for all neighbors. In practice, mini-batch training with sampling (as in GraphSAGE and later DGL/PyG implementations) is essential for large graphs.
A GAT layer uses K=8 attention heads, each with dout=8. Intermediate heads are concatenated, final head is averaged. How many output features after: (a) an intermediate layer? (b) the final layer?
The GAT paper uses concatenation for hidden layers (to preserve expressiveness) but averaging for the output layer (to produce a fixed-size output for classification). This is analogous to multi-head attention in transformers, where heads are concatenated then projected.
You just built a 10-layer GCN and... it performs worse than a 2-layer one. What happened? Each GCN layer averages a node's features with its neighbors. After k layers, each node's representation is influenced by all nodes within k hops. On small-diameter graphs, a few layers are enough to mix ALL nodes' features together, making every representation converge to the same vector. This is over-smoothing.
For a path graph P5 (5 nodes in a line: 0-1-2-3-4), what is the diameter? (Diameter = longest shortest path between any two nodes.)
On this graph, a 4-layer GCN lets every node see every other node. After that, additional layers only smooth further. For real-world social networks with diameter ~6 (six degrees of separation), even a 3-layer GNN covers most of the graph.
The graph Laplacian is L = D - A. For the path graph P3 (nodes 0-1-2): compute L and find its eigenvalues. (Hint: L is a 3×3 symmetric matrix.)
The smallest eigenvalue of any graph Laplacian is always 0 (eigenvector = all-ones). The second-smallest eigenvalue λ2 is the algebraic connectivity (Fiedler value).
λ2 = 1. This measures how well-connected the graph is. For the complete graph K3, λ2 = 3 (maximally connected). For a disconnected graph, λ2 = 0. Higher λ2 means faster mixing, which means faster over-smoothing in GNNs.
The star graph has diameter 2, so after 2 GCN layers every node's receptive field covers the entire graph. The chain has diameter 20, so you'd need 20 layers to fully connect it. Over-smoothing is directly tied to the spectral gap (λ2) of the normalized Laplacian — larger gap = faster convergence to the stationary distribution = faster over-smoothing. The star graph has a large spectral gap.
Increasing width does NOT help with over-smoothing. The problem is that (D̄-1Ã)k converges to rank-1 regardless of feature dimension. Residual connections preserve the original signal. JK-Net accesses early-layer representations that haven't been smoothed yet. DropEdge slows the mixing by removing random edges each epoch. All three are proven strategies; wider layers are not.
The MAD (Mean Average Distance) between node representations measures over-smoothing. After 1 layer on our 4-node path graph, H' = [[0.5,0.5],[0.667,0.667],[0.333,0.667],[0.5,0.5]]. Compute the MAD: average pairwise L2 distance between all node representations.
There are C(4,2) = 6 pairs. Compute L2 distance for each, then average.
The MAD after the initial features X is much higher. As you add layers, MAD decreases toward 0 — that's over-smoothing quantified. When MAD ≈ 0, all node representations are essentially identical and classification becomes impossible.
GCNs are actually an approximation of spectral graph convolution. To understand why, you need the graph Laplacian and its eigenvectors — the "Fourier basis" of the graph.
Compute the graph Laplacian L = D - A for the complete graph K3 (3 nodes, every pair connected). Then find all three eigenvalues.
By symmetry of Kn, the Laplacian eigenvalues are known: λ = 0 (multiplicity 1) and λ = n (multiplicity n-1).
The eigenvector for λ=0 is always [1,1,...,1]/√n (constant signal). The eigenvectors for λ=3 span the space orthogonal to this — they represent "high frequency" signals where adjacent nodes differ. A GCN (low-pass filter) suppresses these λ=3 components, smoothing towards the constant eigenvector.
For K3, compute Lnorm = I - D-1/2AD-1/2. What is Lnorm[0][1]?
D-1/2 = diag(1/√2, 1/√2, 1/√2) for K3.
The normalized Laplacian has eigenvalues in [0, 2]. For K3: λ = 0, 3/2, 3/2. The normalization rescales the spectrum into a fixed range, which is why Kipf & Welling approximate λmax ≈ 2 for the Chebyshev polynomial.
Chebyshev polynomials: T0(x) = 1, T1(x) = x, Tk(x) = 2xTk-1(x) - Tk-2(x). Compute T2(x) and T3(x).
In ChebNet, K=3 means the filter uses T0(L̃), T1(L̃), T2(L̃), T3(L̃) with learnable coefficients θ0,...,θ3. Each Tk(L̃)x involves exactly k hops of neighborhood aggregation, so K controls the receptive field size. GCN (K=1) only uses T0 and T1.
The Chebyshev recurrence Tk(L)x = 2L · Tk-1(L)x - Tk-2(L)x only needs matrix-vector multiplies with L. Since L is sparse (|E| nonzeros), each multiply is O(|E|). K iterations cost O(K|E|), which is linear in graph size. No eigendecomposition needed. This is the fundamental insight that made spectral GNNs practical.
The GCN filter in the spectral domain is g(λ) = 1 - λ (for the normalized Laplacian). For eigenvalues λ = 0 and λ = 1.5 of K3, what are the filter responses g(0) and g(1.5)? Which component is amplified?
The low-frequency component (λ = 0, the "constant" signal) has gain 1 — perfectly preserved. The high-frequency component (λ = 1.5) has gain -0.5 — attenuated by half. After k layers, this becomes (-0.5)k, which rapidly goes to 0. This IS over-smoothing, viewed through the spectral lens: high-frequency information (node differences) is exponentially suppressed.
Knowledge graphs encode facts as (head, relation, tail) triples — (Einstein, born_in, Germany). The challenge: embed entities and relations into continuous vector spaces so that the geometric structure reflects semantic relationships. Two classic approaches:
Given embeddings: h = [1, 2, 0], r = [1, -1, 1], t = [2, 1, 1]. Compute the TransE score f(h, r, t) = -||h + r - t||2.
A perfect score of 0 means the translation h + r lands exactly on t. In practice, scores are slightly negative due to imperfect embeddings. The loss function (margin-based) pushes correct triples toward 0 and incorrect ones below -γ (the margin).
Same embeddings: h = [1, 2, 0], r = [1, -1, 1], t = [2, 1, 1]. Compute the DistMult score = ∑i hi · ri · ti.
Both TransE and DistMult give 0 for this triple, but for different geometric reasons. TransE measures translational distance; DistMult measures multiplicative interaction. Note that DistMult(h, r, t) = DistMult(t, r, h) — it's symmetric. This means it can't distinguish "born_in" from "birthplace_of." TransE can, because h + r = t but t + r ≠ h.
If (Einstein, born_in, Germany) and (Hilbert, born_in, Germany) are both true, then hEinstein + r ≈ tGermany and hHilbert + r ≈ tGermany. This forces hEinstein ≈ hHilbert. For 1-to-N relations, all head entities sharing the same (relation, tail) get collapsed to the same point. Symmetric relations are also problematic: h + r = t and t + r = h forces r = 0. RotatE and ComplEx address these issues using rotations/complex numbers.
In TransE training with margin γ = 1, you have positive triple (h, r, t) with score -0.3 and negative triple (h, r, t') with score -1.5. Compute the margin loss: max(0, γ + fpos - fneg). Should the model update?
Wait — the TransE score for the positive should be HIGHER (closer to 0) than the negative. Let me reconsider the formula. The standard margin loss is:
where d = ||h + r - t|| (distance, not score). So dpos = 0.3, dneg = 1.5:
Actually, the answer depends on whether we use score or distance. Using the negative score formulation where we want scorepos > scoreneg + γ:
Loss = 0. No update needed. The positive triple already scores higher than the negative by more than the margin. The model has successfully separated this pair.
A knowledge graph has 15,000 entities, 237 relations, and you use TransE with embedding dimension d = 200. How many parameters total?
The entity embeddings dominate. This is the FB15k-237 benchmark scale. Larger KGs like Wikidata (100M+ entities) at d=200 would need 20B+ parameters just for entity embeddings — this is why KG embedding at web scale requires careful engineering (e.g., PBG by Facebook uses distributed training).
Given a graph with some edges, predict which edges are missing. This is the bread-and-butter task of graph ML — friend recommendation, drug-target interaction, paper citation prediction. We'll cover both classical heuristics and GNN-based approaches.
Using our 5-node graph: edges (0,1), (0,2), (1,2), (1,3), (2,3), (3,4).
Compute CN(0, 3) — the number of common neighbors of node 0 and node 3.
Two common neighbors suggest a likely edge between 0 and 3. In social networks, this is "friends of friends" — the more mutual friends two people have, the more likely they should be connected.
Compute J(0, 3) = |N(0) ∩ N(3)| / |N(0) ∪ N(3)|.
Jaccard normalizes by the union. This prevents high-degree nodes from dominating: CN(celebrity, anyone) is large simply because the celebrity has many neighbors. Jaccard says "what fraction of their combined connections do they share?" — a more meaningful similarity measure.
Compute AA(0, 3). Common neighbors are {1, 2} with degrees deg(1) = 3, deg(2) = 3.
If the common neighbors had degree 100 instead: AA = 2/log(100) = 2/4.605 = 0.434 — much lower. High-degree common neighbors contribute less. Adamic-Adar consistently outperforms CN and Jaccard on real-world link prediction benchmarks.
After running a GNN, node 0 has embedding z0 = [0.5, 0.5, 0.3] and node 3 has z3 = [0.4, 0.6, 0.2]. Compute the link prediction score using dot product.
The dot product measures embedding similarity. Higher score = more likely edge. To convert to a probability, pass through sigmoid: σ(0.56) = 0.636. Training uses binary cross-entropy: existing edges are positive examples, sampled non-edges are negatives.
You have 3 positive pairs (existing edges) with scores [0.8, 0.6, 0.5] and 3 negative pairs (non-edges) with scores [0.4, 0.3, 0.7]. AUC = fraction of (positive, negative) pairs where the positive has a higher score. Compute AUC.
The negative pair with score 0.7 is causing problems — it's ranked higher than two positive pairs. AUC = 0.5 is random; AUC = 1.0 is perfect separation. 0.778 is decent but shows the model is confusing some non-edges for edges. State-of-the-art link prediction AUCs on citation networks are ~0.95+.
This link prediction evaluation code has a data leakage bug. Click the problematic line.
function evaluate(graph, model) { // Split edges into train/test const [trainEdges, testEdges] = splitEdges(graph, 0.8); // Train GNN on full graph (all edges) model.train(graph); // Get embeddings const embeddings = model.embed(graph); // Evaluate on test edges return computeAUC(embeddings, testEdges); }
Lines 4-5 (specifically line 4) are the bug. The model is trained on the FULL graph including test edges, then evaluated on those same test edges. This is data leakage — the model has already seen the "test" edges during training via the message passing (they're in the adjacency matrix). The fix: train the GNN only on the subgraph with train edges, removing test edges from the adjacency matrix during training.
Generating new graphs that look like examples from a training set — molecular graphs, circuit layouts, social network structures. The key challenge: graphs have no canonical ordering of nodes. The same graph can be represented by N! different adjacency matrices (one per node permutation).
How many distinct node orderings (permutations) exist for a graph with n = 4 nodes?
Each permutation gives a different adjacency matrix representation of the same graph. If a graph has automorphisms (symmetries), some permutations give identical adjacency matrices, so the number of distinct matrices is 4! / |Aut(G)|. For a graph with no symmetry, all 24 matrices are different.
In GraphRNN, node i needs to predict edges to nodes 0, 1, ..., i-1. For a graph with n=5 nodes, how many total binary decisions (edge/no-edge) must be made to generate the full adjacency matrix?
This is exactly the number of entries in the lower triangle of the adjacency matrix — one per possible edge. The sequence length grows quadratically. For n = 100, that's 4,950 decisions. GraphRNN uses a "BFS edge ordering" trick to reduce the effective sequence length by only predicting edges to recent BFS neighbors.
What is the maximum number of edges in a simple undirected graph with n = 10 nodes? (No self-loops, no multi-edges.)
This is the complete graph K10. Real molecular graphs are much sparser (each atom bonds to 1-4 others), so the edge density is typically |E| / C(n,2) < 0.1. Graph generation models exploit this sparsity — most edge predictions should be "no edge."
BFS visits nodes layer by layer from a starting node. Node i in BFS order is typically connected only to nodes within the current and previous BFS layer — a narrow window of size M. Instead of predicting i-1 edge decisions, node i only predicts M decisions (M is often 3-5 for sparse graphs). This reduces the effective sequence length from O(n²) to O(nM), making generation tractable for larger graphs.
Graph generation quality is measured by comparing DISTRIBUTIONS of graph statistics (degree distribution, clustering coefficient distribution, orbit counts) between generated and real graphs, using metrics like MMD (Maximum Mean Discrepancy). These statistics are permutation-invariant — they don't depend on which node is labeled 0, 1, 2, etc. Two isomorphic graphs contribute identically to all these metrics.
You're designing a GNN for molecular property prediction — the quintessential GNN application. Given a molecule (atoms = nodes, bonds = edges, atom types and bond types as features), predict a property like solubility or toxicity. Every design choice matters: aggregation, depth, readout, expressiveness.
Ethanol's heavy-atom skeleton: C—C—O, a simple path of 3 nodes with 2 edges. In practice, hydrogen atoms are often omitted from the graph (they can be inferred from valence rules), reducing graph size significantly. Node features encode atom type (C=6, O=8), formal charge, aromaticity, etc. Edge features encode bond type (single, double, triple, aromatic).
Order these steps to build a molecular property prediction pipeline:
Correct order: SMILES → Encode → GNN Layers → Readout → Predict
1. Parse the SMILES string into a molecular graph (atoms, bonds, features)
2. Encode atom features (atomic number, degree, charge, etc.) into d-dimensional vectors
3. Run K layers of GNN message passing to propagate information
4. Readout: pool all node embeddings into a single graph-level vector
5. MLP head maps graph vector to the predicted property (scalar)
Your molecular GNN: atom embedding (119 atom types, d=64), 3 GCN layers (64→128→128→64), readout (mean pool), MLP head (64→32→1). All layers have biases. Total parameters?
~43K parameters — tiny by transformer standards! Most GNNs for molecular property prediction are in the 10K-500K parameter range. The inductive bias from graph structure (message passing on molecular topology) compensates for the small parameter budget. Larger is not always better — GNNs are often data-limited (thousands of molecules, not billions of tokens).
The paper "How Powerful are Graph Neural Networks?" (Xu et al., 2019) proved that GIN with sum aggregation and injective update functions (MLPs) is as powerful as the 1-WL test. Mean and max aggregations lose information: mean can't distinguish {1,1,1} from {1}, and max can't distinguish {1,1,1} from {1,2,1} (if max is the same). Sum is injective on multisets: sum({1,1,1}) = 3 ≠ sum({1}) = 1. GIN exploits this with the update: h(k)v = MLP((1+ε) · h(k-1)v + ∑ h(k-1)u).
Sum pooling preserves the most information. Mean pooling loses size: a molecule with 10 carbon atoms and one with 20 carbons (with the same average embedding) would be indistinguishable. Max pooling loses multiplicity. In practice, concatenating sum + mean + max often works best — each captures different information. Set2Set (an LSTM-based readout) is even more expressive but slower.
Given a 2D array of node embeddings (N nodes, d features), implement sum, mean, and max readout. Return an object with all three.
javascript function graphReadout(embeddings) { const N = embeddings.length, d = embeddings[0].length; const sum = Array(d).fill(0); const max = Array(d).fill(-Infinity); for (let i = 0; i < N; i++) { for (let j = 0; j < d; j++) { sum[j] += embeddings[i][j]; if (embeddings[i][j] > max[j]) max[j] = embeddings[i][j]; } } const mean = sum.map(v => v / N); return { sum, mean, max }; }
3-5 layers is the sweet spot for molecular graphs. With diameter ~5-10, 3-5 layers let every atom "see" every other atom. More layers cause over-smoothing (Chapter 4) without adding information. Empirically, GNN performance on molecular benchmarks peaks at 3-5 layers and degrades beyond that. Compare to transformers: they can use 96+ layers because they have residual connections, layer normalization, and self-attention (not neighborhood averaging).
This molecular GNN training loop has a subtle bug that causes the model to learn shortcuts instead of molecular structure. Click the buggy line.
function trainStep(molecules, labels, model) { const graphs = molecules.map(mol => molToGraph(mol)); // Sort graphs by size for efficient batching graphs.sort((a, b) => a.numNodes - b.numNodes); const embeddings = model.forward(graphs); const predictions = mlpHead(sumPool(embeddings)); return mseLoss(predictions, labels); }
Line 4 is the bug. The code sorts graphs by size but does NOT sort the labels to match. After sorting, graph[i] no longer corresponds to labels[i]. The model learns to correlate molecule size with the target property (a spurious shortcut) instead of learning actual chemical structure. The fix: sort (graph, label) pairs together, or remove the sorting.
| Topic | Lesson |
|---|---|
| Transformer fundamentals | Transformer — From Absolute Zero |
| Transformer math practice | Transformer Math Workbook |
| Contrastive learning | Contrastive CLIP — From Absolute Zero |
| RL on graphs | RL Algorithms — From Absolute Zero |