Most graph methods memorize embeddings for known nodes. GraphSAGE learns a function that generates embeddings — meaning it works on nodes it has never seen before, sampling neighborhoods and aggregating features rather than storing per-node lookup tables.
Imagine Pinterest's recommendation system. Every day, millions of new images get pinned. Each new image is a brand-new node in the graph — no prior history, no pre-computed embedding. How do you recommend content involving a node that didn't exist when you trained your model?
Methods like DeepWalk and node2vec are transductive: they run random walks over the entire fixed graph, factorize the resulting similarity matrix, and produce one embedding vector per node. The embeddings are just a big lookup table. If you add a new node tomorrow, you have to retrain from scratch on the full graph. For a network that grows by millions of nodes per day, this is catastrophically expensive.
The transductive approach also makes a deeper mistake: it learns where nodes are in the graph but not what they look like. It cannot generalize to unseen nodes because the embedding table has no entry for them. What we need is not an embedding table but an embedding function — something that takes a node's features and neighborhood and produces an embedding on demand.
Click "Add node" to see what happens when a new node arrives in each paradigm.
GraphSAGE's core algorithm has two steps for every node you want to embed: sample a fixed-size neighborhood, then aggregate the neighbors' features into a single vector. Repeat this for each layer (each hop in the graph).
Why sample rather than use all neighbors? Real nodes have wildly varying degree — a celebrity Twitter account might have 10 million followers. Using all neighbors for every gradient step would make batch sizes unpredictable and computation uncontrollable. GraphSAGE fixes the neighborhood size by sampling S₁ neighbors at hop 1 and S₂ at hop 2 (typically S₁=25, S₂=10). This bounds computation regardless of node degree.
Read these two equations as: (1) aggregate the previous-layer embeddings of your sampled neighbors, then (2) concatenate your own previous-layer embedding with that neighborhood summary, project through a learned weight matrix W, apply nonlinearity. The result is your new embedding at this layer.
After K layers of this, the embedding of node v encodes information from its K-hop neighborhood. With K=2, S₁=25, S₂=10: each node aggregates up to 25 one-hop neighbors, each of whom already aggregated up to 10 of their own neighbors. The receptive field is bounded at 25×10=250 nodes — predictable, efficient, parallelizable.
Watch how information flows inward from sampled neighbors across two hops.
The AGG function is where GraphSAGE offers design choices. The paper proposes three aggregators, each with different expressiveness and computational cost. The aggregator must be permutation invariant — it should not matter what order you list the neighbors, since graphs have no natural ordering.
Mean aggregator: Take the element-wise mean of all neighbors' embeddings. This is essentially what spectral GCNs compute. It's fast, parameter-free within the aggregation step, and equivalent to message-passing with averaging. But it treats all neighbors equally — a popular neighbor and an obscure one get the same weight.
Max-pool aggregator: Apply a learned MLP to each neighbor's embedding, then take the element-wise maximum across all transformed neighbors. The max picks up the "most extreme" feature in each dimension — whether any neighbor has a high activation for a given learned feature.
LSTM aggregator: Pass the neighbors through an LSTM in random order (since LSTMs are order-sensitive, random permutation approximates permutation invariance). LSTMs have the most parameters and capture complex interactions, but they're the slowest and their order-sensitivity is a theoretical weakness.
All three aggregators receive the same 5 neighbor embeddings (dim=3). Watch how they combine them differently. Click "Randomize" for new neighbor features.
Full-graph training means materializing embeddings for every node at every layer simultaneously. For graphs with millions of nodes, this doesn't fit in GPU memory. GraphSAGE enables mini-batch training through a process called neighborhood expansion.
To compute the embedding of a batch of target nodes, you need their 1-hop sampled neighbors. To compute those neighbors' embeddings, you need their 1-hop neighbors — which are the 2-hop neighbors of the targets. So you recursively expand the computation graph outward before passing forward through the layers.
The forward pass then proceeds bottom-up: compute 2-hop node embeddings first (they need no further lookups), use those to compute 1-hop embeddings, use those to compute the target embeddings. This is the same computation graph as full-batch training, just restricted to the sampled neighborhoods.
python def sample_computation_graph(seeds, G, S1, S2): # seeds: target nodes for this mini-batch hop1 = [] for v in seeds: nbrs = random.sample(list(G.neighbors(v)), min(S1, G.degree(v))) hop1.extend(nbrs) hop1 = list(set(hop1)) # deduplicate hop2 = [] for v in hop1: nbrs = random.sample(list(G.neighbors(v)), min(S2, G.degree(v))) hop2.extend(nbrs) hop2 = list(set(hop2)) return seeds, hop1, hop2 # compute bottom-up: hop2 → hop1 → seeds
Here is the key test of inductiveness: after training, take the entire trained model (its weight matrices W¹, W², and aggregator parameters) to a completely new graph — one with different nodes, different structure, different scale. Can it generate embeddings for those new nodes?
Yes. The weights encode a function: "given any node's features and its sampled neighborhood's features, produce an embedding." The function doesn't care which specific nodes are in the neighborhood — it cares what their features look like. So a node in the new graph with similar structural role and similar features to training-graph nodes will receive a similar embedding.
The paper tests this directly on citation networks: train on a subgraph of papers published before 2013, then embed papers published in 2013 (never seen during training) and predict their topic labels. GraphSAGE achieves 77.4% micro-F1. DeepWalk can't do this at all — it can't embed new nodes without retraining.
Training nodes are shown in blue/teal. New test nodes (never seen during training) arrive in orange. The same model weights apply to both.
The paper evaluates GraphSAGE on three diverse benchmarks chosen specifically to test inductiveness — all tests use nodes not seen during training.
| Dataset | Task | DeepWalk | GraphSAGE-mean | GraphSAGE-LSTM | GraphSAGE-pool |
|---|---|---|---|---|---|
| Citation (Cora+) | Node classification | Can't generalize | 77.4% F1 | 78.6% F1 | 78.2% F1 |
| Post community | Can't generalize | 89.7% F1 | 90.7% F1 | 91.2% F1 | |
| Protein PPI | Gene ontology roles | Can't generalize | 61.2% F1 | 61.7% F1 | 63.6% F1 |
Reddit is the most impressive benchmark: 232,965 posts, 11.6 million edges. GraphSAGE trains on posts from the first 3 weeks, then embeds posts from the 4th week — unseen during training. Despite the scale, pool and LSTM aggregators both exceed 90% micro-F1 on 50-class classification.
The protein interaction (PPI) dataset has 24 graphs — the model trains on 20 graphs and tests on 2 completely unseen graphs. This is the purest inductive test: not just new nodes, but entirely new graph structures. GraphSAGE-pool reaches 63.6% micro-F1, vs supervised features at ~60%, showing the GNN aggregation genuinely captures structural information useful beyond a single graph.
GraphSAGE was the academic paper. PinSage (published one year later, also co-authored by Ying and Leskovec) is what happens when you deploy GraphSAGE on Pinterest's 3-billion-node graph with 18 billion edges. It's the same core idea refined for industrial scale.
Pinterest's content graph has two node types: pins (images) and boards (collections). An edge means "this pin appears on this board." The task: given a pin you're looking at, recommend visually and semantically similar pins. This requires embedding all 2 billion pins in a space where nearby vectors = similar content.
Three key PinSage innovations beyond GraphSAGE:
| Property | GraphSAGE | PinSage |
|---|---|---|
| Graph size | ~250K nodes | 3 billion nodes |
| Neighbor sampling | Uniform random | Importance (random walk) |
| Negatives | Random | Curriculum (easy → hard) |
| Inference | Online | Offline + ANN index |
| Deployment | Academic | Production at Pinterest |
GraphSAGE sits at the intersection of several lines of work and opened several more. Understanding its position clarifies both what it contributed and what it left unsolved.
| Method | Key idea | Limitation vs GraphSAGE |
|---|---|---|
| DeepWalk / node2vec | Random walk + Word2Vec | Transductive — can't embed new nodes |
| Spectral GCN (Kipf & Welling) | Graph Laplacian convolution | Transductive — tied to fixed adjacency matrix |
| GraphSAGE (this paper) | Sample neighbors + aggregate | Fixed uniform sampling, no attention |
| GAT (Graph Attention) | Attention-weighted aggregation | Full neighborhood (no sampling) |
| PinSage | GraphSAGE + importance sampling | Domain-specific (bipartite rec) |
The Graph Attention Network (GAT) addresses GraphSAGE's equal-weighting limitation: instead of mean/pool, it learns attention scores between every node and each of its neighbors, dynamically weighting each neighbor's contribution. The tradeoff: GAT uses all neighbors (no sampling), making it less scalable but more expressive.
ClusterGCN (Chiang et al., 2019) takes a different approach to scalability: partition the graph into dense clusters, then sample entire clusters as mini-batches. Nodes within a cluster have high connectivity, so the neighborhood expansion is mostly contained within the batch. This dramatically reduces the cross-batch edges that cause neighbor-explosion in GraphSAGE's frontier.
Related lessons
Key papers
"The key insight is to learn a function that aggregates a node's neighborhood rather than training a distinct embedding vector for each node."
— Hamilton, Ying, Leskovec (2017)