GraphSAGE — Veanors

Chapter 0: The Problem

Imagine Pinterest's recommendation system. Every day, millions of new images get pinned. Each new image is a brand-new node in the graph — no prior history, no pre-computed embedding. How do you recommend content involving a node that didn't exist when you trained your model?

Methods like DeepWalk and node2vec are transductive: they run random walks over the entire fixed graph, factorize the resulting similarity matrix, and produce one embedding vector per node. The embeddings are just a big lookup table. If you add a new node tomorrow, you have to retrain from scratch on the full graph. For a network that grows by millions of nodes per day, this is catastrophically expensive.

The transductive approach also makes a deeper mistake: it learns where nodes are in the graph but not what they look like. It cannot generalize to unseen nodes because the embedding table has no entry for them. What we need is not an embedding table but an embedding function — something that takes a node's features and neighborhood and produces an embedding on demand.

The key shift: Instead of asking "what is the embedding of node 42?", GraphSAGE asks "given node 42's features and its neighbors' features, what should the embedding be?" The answer is a learned function. The function generalizes; the lookup table doesn't.

Transductive vs. Inductive

Click "Add node" to see what happens when a new node arrives in each paradigm.

Why can't transductive methods like DeepWalk handle new nodes without retraining?

They store one embedding vector per known node in a fixed lookup table, so they have no entry for nodes that didn't exist during training They require GPU memory proportional to the number of edges, so large graphs don't fit They use random walks which are too slow to run on new nodes

Chapter 1: Sample & Aggregate

GraphSAGE's core algorithm has two steps for every node you want to embed: sample a fixed-size neighborhood, then aggregate the neighbors' features into a single vector. Repeat this for each layer (each hop in the graph).

Why sample rather than use all neighbors? Real nodes have wildly varying degree — a celebrity Twitter account might have 10 million followers. Using all neighbors for every gradient step would make batch sizes unpredictable and computation uncontrollable. GraphSAGE fixes the neighborhood size by sampling S₁ neighbors at hop 1 and S₂ at hop 2 (typically S₁=25, S₂=10). This bounds computation regardless of node degree.

h^(k)_N(v) = AGG_k( { h^(k-1)_u, ∀u ∈ N(v) } )

h^(k)_v = σ( W^k · CONCAT( h^(k-1)_v, h^(k)_N(v) ) )

Read these two equations as: (1) aggregate the previous-layer embeddings of your sampled neighbors, then (2) concatenate your own previous-layer embedding with that neighborhood summary, project through a learned weight matrix W, apply nonlinearity. The result is your new embedding at this layer.

After K layers of this, the embedding of node v encodes information from its K-hop neighborhood. With K=2, S₁=25, S₂=10: each node aggregates up to 25 one-hop neighbors, each of whom already aggregated up to 10 of their own neighbors. The receptive field is bounded at 25×10=250 nodes — predictable, efficient, parallelizable.

Concat, not add: GraphSAGE concatenates the node's own embedding with the neighborhood summary before projecting. This lets the model learn separate weights for "what I look like" vs "what my neighborhood looks like" — a crucial distinction. Simply averaging them together would lose this structure.

K-Hop Neighborhood Sampling

Watch how information flows inward from sampled neighbors across two hops.

K (hops) 1

Why does GraphSAGE sample a fixed number of neighbors rather than using all neighbors?

To bound computation per node regardless of degree — nodes with millions of neighbors would otherwise make batches unpredictably expensive Because including all neighbors causes gradient vanishing in the aggregation step To enforce that the embedding depends only on the node's own features, not its context

Chapter 2: Aggregator Options

The AGG function is where GraphSAGE offers design choices. The paper proposes three aggregators, each with different expressiveness and computational cost. The aggregator must be permutation invariant — it should not matter what order you list the neighbors, since graphs have no natural ordering.

Mean aggregator: Take the element-wise mean of all neighbors' embeddings. This is essentially what spectral GCNs compute. It's fast, parameter-free within the aggregation step, and equivalent to message-passing with averaging. But it treats all neighbors equally — a popular neighbor and an obscure one get the same weight.

AGG_mean = mean( { h_u, ∀u ∈ N(v) } )

Max-pool aggregator: Apply a learned MLP to each neighbor's embedding, then take the element-wise maximum across all transformed neighbors. The max picks up the "most extreme" feature in each dimension — whether any neighbor has a high activation for a given learned feature.

AGG_pool = max( { σ( W_pool h_u + b ), ∀u ∈ N(v) } )

LSTM aggregator: Pass the neighbors through an LSTM in random order (since LSTMs are order-sensitive, random permutation approximates permutation invariance). LSTMs have the most parameters and capture complex interactions, but they're the slowest and their order-sensitivity is a theoretical weakness.

Which is best? The paper finds max-pool and LSTM aggregators generally outperform mean on benchmark datasets. The max-pool aggregator in particular is competitive with LSTM at lower cost. But "best" depends on your graph — for homogeneous graphs where all neighbors are equally informative, mean works well.

Aggregator Showcase — Compare on 5 Neighbors

All three aggregators receive the same 5 neighbor embeddings (dim=3). Watch how they combine them differently. Click "Randomize" for new neighbor features.

Why must a GraphSAGE aggregator be permutation invariant?

Graphs have no natural ordering of neighbors, so the embedding must not depend on the arbitrary order in which we list them Permutation invariance makes the aggregator differentiable so gradients can flow It allows the model to run on directed and undirected graphs without modification

Chapter 3: Mini-batch Training

Full-graph training means materializing embeddings for every node at every layer simultaneously. For graphs with millions of nodes, this doesn't fit in GPU memory. GraphSAGE enables mini-batch training through a process called neighborhood expansion.

To compute the embedding of a batch of target nodes, you need their 1-hop sampled neighbors. To compute those neighbors' embeddings, you need their 1-hop neighbors — which are the 2-hop neighbors of the targets. So you recursively expand the computation graph outward before passing forward through the layers.

Batch of target nodes

e.g., 512 nodes to embed

↓ sample S₁=25 neighbors each

1-hop frontier

up to 512×25 = 12,800 unique nodes

↓ sample S₂=10 neighbors each

2-hop frontier

up to 128,000 unique nodes (in practice much fewer due to overlap)

↓ forward pass (bottom-up)

Target embeddings

512 embedding vectors, ready for loss computation

The forward pass then proceeds bottom-up: compute 2-hop node embeddings first (they need no further lookups), use those to compute 1-hop embeddings, use those to compute the target embeddings. This is the same computation graph as full-batch training, just restricted to the sampled neighborhoods.

Why overlap matters: Two neighboring nodes often share neighbors. When computing the 2-hop frontier, you deduplicate — a node that appears as a neighbor of multiple 1-hop nodes is only embedded once and shared. This overlap dramatically reduces the actual frontier size from the theoretical maximum.

python
def sample_computation_graph(seeds, G, S1, S2):
    # seeds: target nodes for this mini-batch
    hop1 = []
    for v in seeds:
        nbrs = random.sample(list(G.neighbors(v)), min(S1, G.degree(v)))
        hop1.extend(nbrs)
    hop1 = list(set(hop1))  # deduplicate

    hop2 = []
    for v in hop1:
        nbrs = random.sample(list(G.neighbors(v)), min(S2, G.degree(v)))
        hop2.extend(nbrs)
    hop2 = list(set(hop2))

    return seeds, hop1, hop2  # compute bottom-up: hop2 → hop1 → seeds

In mini-batch GraphSAGE with K=2, why do you expand the neighborhood BEFORE the forward pass?

To identify all nodes whose embeddings are needed to compute the target embeddings — you must look up their features before computing anything To sort the neighbors in ascending degree order for efficient CUDA kernels To run the aggregation in parallel across all layers simultaneously

Chapter 4: Inductive Generalization

Here is the key test of inductiveness: after training, take the entire trained model (its weight matrices W¹, W², and aggregator parameters) to a completely new graph — one with different nodes, different structure, different scale. Can it generate embeddings for those new nodes?

Yes. The weights encode a function: "given any node's features and its sampled neighborhood's features, produce an embedding." The function doesn't care which specific nodes are in the neighborhood — it cares what their features look like. So a node in the new graph with similar structural role and similar features to training-graph nodes will receive a similar embedding.

The analogy: Think of how you recognize a new city you've never visited. You haven't memorized that city's map — but you've learned general concepts: this type of signage means a hospital, these building shapes suggest residential areas, that density suggests downtown. You generalize from principles, not from memorization. GraphSAGE's parameters are the principles.

The paper tests this directly on citation networks: train on a subgraph of papers published before 2013, then embed papers published in 2013 (never seen during training) and predict their topic labels. GraphSAGE achieves 77.4% micro-F1. DeepWalk can't do this at all — it can't embed new nodes without retraining.

Inductive Embedding Demo

Training nodes are shown in blue/teal. New test nodes (never seen during training) arrive in orange. The same model weights apply to both.

What enables GraphSAGE to generalize to completely new graphs after training?

Its learned weight matrices define a function over node features and neighborhoods — a function generalizes, unlike a per-node embedding lookup table It uses attention weights that adapt to each new node during inference It fine-tunes a small adapter layer on the new graph before generating embeddings

Chapter 5: Results

The paper evaluates GraphSAGE on three diverse benchmarks chosen specifically to test inductiveness — all tests use nodes not seen during training.

Dataset	Task	DeepWalk	GraphSAGE-mean	GraphSAGE-LSTM	GraphSAGE-pool
Citation (Cora+)	Node classification	Can't generalize	77.4% F1	78.6% F1	78.2% F1
Reddit	Post community	Can't generalize	89.7% F1	90.7% F1	91.2% F1
Protein PPI	Gene ontology roles	Can't generalize	61.2% F1	61.7% F1	63.6% F1

Reddit is the most impressive benchmark: 232,965 posts, 11.6 million edges. GraphSAGE trains on posts from the first 3 weeks, then embeds posts from the 4th week — unseen during training. Despite the scale, pool and LSTM aggregators both exceed 90% micro-F1 on 50-class classification.

Compared to DeepWalk: The paper isn't fair to DeepWalk since DeepWalk cannot handle the inductive setting at all — so it must be retrained on the full graph including test nodes, which is cheating in its favor. Even so, GraphSAGE-LSTM beats a fully-retrained DeepWalk by 12+ F1 points on Reddit.

The protein interaction (PPI) dataset has 24 graphs — the model trains on 20 graphs and tests on 2 completely unseen graphs. This is the purest inductive test: not just new nodes, but entirely new graph structures. GraphSAGE-pool reaches 63.6% micro-F1, vs supervised features at ~60%, showing the GNN aggregation genuinely captures structural information useful beyond a single graph.

On the PPI dataset, why is the inductive setting especially challenging compared to citation networks?

The test graphs are entirely new graphs — new nodes AND new graph structure — not just new nodes in the same graph PPI has more nodes than citation networks, making embedding harder The labels in PPI are continuous rather than discrete, requiring regression

Chapter 6: GraphSAGE → PinSage

GraphSAGE was the academic paper. PinSage (published one year later, also co-authored by Ying and Leskovec) is what happens when you deploy GraphSAGE on Pinterest's 3-billion-node graph with 18 billion edges. It's the same core idea refined for industrial scale.

Pinterest's content graph has two node types: pins (images) and boards (collections). An edge means "this pin appears on this board." The task: given a pin you're looking at, recommend visually and semantically similar pins. This requires embedding all 2 billion pins in a space where nearby vectors = similar content.

The billion-node challenge: Random neighbor sampling from disk requires I/O proportional to the frontier size. At Pinterest's scale, even uniform sampling causes too many cache misses. PinSage uses importance-based sampling: run short random walks from each node and sample neighbors proportional to visit frequency. Frequently-visited neighbors contribute more to the embedding — this captures structural importance, not just adjacency.

Three key PinSage innovations beyond GraphSAGE:

Importance-weighted aggregation: Weight each sampled neighbor's contribution by its random-walk visit count. Frequently co-visited nodes (implicitly similar) receive higher weight.
Curriculum training: Start with easy negative examples (random pins), progressively add harder negatives (pins from the same board but ranked low by current model). This prevents the model from collapsing to trivial solutions.
Map-reduce inference: Generate embeddings for all 3B pins in distributed batches, write to a key-value store, then run approximate nearest neighbor search (FAISS) for retrieval. Inference runs offline; serving is just a lookup.

Property	GraphSAGE	PinSage
Graph size	~250K nodes	3 billion nodes
Neighbor sampling	Uniform random	Importance (random walk)
Negatives	Random	Curriculum (easy → hard)
Inference	Online	Offline + ANN index
Deployment	Academic	Production at Pinterest

What does importance-based sampling in PinSage use to weight neighbors instead of uniform sampling?

Visit frequency from short random walks — neighbors that a random walk from the source visits more often are more important and get higher weight Node degree — higher-degree neighbors are sampled and weighted more heavily Edge timestamps — more recent connections are weighted higher than older ones

Chapter 7: Connections

GraphSAGE sits at the intersection of several lines of work and opened several more. Understanding its position clarifies both what it contributed and what it left unsolved.

Method	Key idea	Limitation vs GraphSAGE
DeepWalk / node2vec	Random walk + Word2Vec	Transductive — can't embed new nodes
Spectral GCN (Kipf & Welling)	Graph Laplacian convolution	Transductive — tied to fixed adjacency matrix
GraphSAGE (this paper)	Sample neighbors + aggregate	Fixed uniform sampling, no attention
GAT (Graph Attention)	Attention-weighted aggregation	Full neighborhood (no sampling)
PinSage	GraphSAGE + importance sampling	Domain-specific (bipartite rec)

The Graph Attention Network (GAT) addresses GraphSAGE's equal-weighting limitation: instead of mean/pool, it learns attention scores between every node and each of its neighbors, dynamically weighting each neighbor's contribution. The tradeoff: GAT uses all neighbors (no sampling), making it less scalable but more expressive.

ClusterGCN (Chiang et al., 2019) takes a different approach to scalability: partition the graph into dense clusters, then sample entire clusters as mini-batches. Nodes within a cluster have high connectivity, so the neighborhood expansion is mostly contained within the batch. This dramatically reduces the cross-batch edges that cause neighbor-explosion in GraphSAGE's frontier.

What GraphSAGE actually solved: The inductive learning problem for graphs. Before this paper, GNNs were mostly academic tools for small, fixed graphs. GraphSAGE showed that the same parameters can generalize across graphs — making GNNs practical for production recommendation systems, new-node classification, and evolving networks. PinSage's deployment was the proof.

Related lessons

LINE — Large-scale network embedding
LightGCN — Simplified GCN for recommendation
NGCF — Neural graph collaborative filtering

Key papers

Hamilton et al., NeurIPS 2017 (GraphSAGE)
Ying et al., KDD 2018 (PinSage)
Velickovic et al., ICLR 2018 (GAT)
Kipf & Welling, ICLR 2017 (Spectral GCN)

"The key insight is to learn a function that aggregates a node's neighborhood rather than training a distinct embedding vector for each node."
— Hamilton, Ying, Leskovec (2017)

GraphSAGE: InductiveGraph Learning