How Pinterest built a GCN that generates embeddings for 3 billion pins across 18 billion edges — without ever loading the full graph into memory. The secret: random walks, importance pooling, and curriculum training with hard negatives.
Imagine you're looking at a photo of a rustic wooden dining table. You want more like it. Pinterest has 2 billion such images — how does it find the right ones?
The obvious answer is image similarity: compare pixel features, find visually similar images. But this misses something crucial. Two images might be visually very different — one is a close-up of carved wooden legs, another is an overhead shot of a white tablecloth — yet both belong to the same "Scandinavian dining room" board. They are semantically related even though they look nothing alike.
Pinterest's data is naturally a bipartite graph: one set of nodes is pins (images), the other is boards (curated collections). An edge connects a pin to a board when a user saves that pin to that board. Two pins that share many common boards are semantically related — users with similar taste put them together.
This is a link prediction task: given a query pin, find pins likely to share boards with it. The graph structure captures semantic relationships that image content alone misses entirely.
Pinterest's graph has 3 billion nodes and 18 billion edges. The standard graph neural network approach loads the full graph Laplacian into memory — a matrix of 3B × 3B entries. That's physically impossible. You'd need more RAM than exists on Earth.
Prior GCN research operated on graphs with tens of thousands of nodes. Pinterest needed to scale by a factor of 100,000×. PinSage is how they did it.
Click a pin to highlight its boards and co-pinned neighbors. Notice how graph neighbors reveal semantic similarity that image content (color) alone misses.
A Graph Convolutional Network (GCN) is a neural network that operates on graph-structured data. Each node starts with a feature vector (for a pin: an image embedding + text embedding). Then, iteratively, each node updates its embedding by aggregating the embeddings of its neighbors.
After K rounds of this aggregation, each node's embedding captures information from its K-hop neighborhood. A pin's embedding now reflects not just its own appearance, but the appearance of every pin that's ever been saved alongside it.
Consider pin A (a wooden dining table) and pin B (a white tablecloth — visually very different). If they both sit on "Scandinavian Dining" boards, they share many board-neighbors. After 2 rounds of GCN aggregation, both embeddings absorb the same neighborhood signal. They end up close in embedding space. Image similarity fails; GCN succeeds.
Standard GCNs formulate the update as a matrix multiplication: H(k+1) = σ(ÖH(k)W(k)), where Ö is the normalized adjacency matrix of the full graph. For Pinterest this is a 3B × 3B matrix. Even storing it in sparse format requires terabytes. Multiplying it is out of the question.
PinSage's solution: never construct the full adjacency matrix. Instead, sample a small local neighborhood for each node on-the-fly using random walks, and only build computation graphs for those small samples.
Watch how a node's embedding absorbs information from its neighborhood after each GCN layer. Click "Next Layer" to advance.
PinSage defines one core operation called convolve (Algorithm 1 in the paper). Given a node u and its sampled neighborhood N(u), convolve produces an updated embedding for u. Stack this operation K=2 times and you have the full model.
All embeddings are projected onto the unit hypersphere. This means the dot product between two embeddings equals their cosine similarity. At inference time, finding the most similar pin to a query is equivalent to finding the nearest neighbor on the hypersphere — a well-studied problem with fast approximate solutions (e.g., FAISS).
PinSage uses K=2 convolve layers. Why so few? Each layer reaches one hop further. Layer 1 aggregates direct neighbors; layer 2 aggregates neighbors-of-neighbors. At Pinterest's graph density, 2-hop neighborhoods already contain extremely informative context. More layers would require sampling exponentially more nodes.
Click each step to see the tensor shapes and what's happening inside PinSage's convolve operation.
Here is the first and most fundamental innovation. A node in Pinterest's graph can have thousands of neighbors — a popular pin might be saved to 50,000 boards, each board connecting it to thousands of other pins. Using all of them is impossible. But which ones do you pick?
From node u, launch T short random walks of length L. Each walk takes random steps along edges. After all walks, count how many times each node v was visited. Call this count c(v). The neighborhood N(u) is defined as the top-K most-visited nodes.
This is not arbitrary. The visit count c(v) is a Monte Carlo estimate of the Personalized PageRank score of v with respect to u — a well-understood measure of proximity in a graph that naturally down-weights high-degree "hub" nodes.
PinSage uses T=10 random walks of length L=50 per node, and keeps the top K=25 most-visited nodes as the neighborhood. These are not magic numbers — they're set so that the walk is long enough to escape the immediate neighborhood but short enough to stay semantically relevant.
The visit counts serve double duty. They not only define WHO is in the neighborhood — they also define HOW MUCH each neighbor contributes during aggregation. A neighbor visited 40 times out of 500 total visits gets weight 40/500 = 0.08. This is the importance pooling used in Step 3 of Algorithm 1.
Click "Start Walk" to simulate random walks from the highlighted node. Watch visit counts accumulate. After enough walks, the top-K neighbors emerge — these become the node's effective neighborhood in PinSage.
Standard GCNs aggregate neighbor features with a uniform mean: every neighbor contributes equally. This treats the node that appears in 1 shared board the same as the node that appears in 200 shared boards. That's clearly wrong.
Importance pooling replaces the uniform mean with a weighted mean. The weight of neighbor v is its normalized random walk visit count: αv = c(v) / Σv′ c(v′). The aggregated neighborhood representation is then:
Nodes visited frequently during random walks from u are structurally close and well-connected to u — they contribute more information. Nodes visited rarely are marginal connections — they contribute less.
A neighbor v visited often by random walks from u is not just close — it's well-connected through multiple paths. If u and v share many common boards, random walks from u frequently wander into v. This is a richer signal than just "are they adjacent?" — it asks "how many routes lead from u to v?"
This is closely related to the Jaccard coefficient (overlap of neighbor sets) but computed implicitly via random walks rather than explicitly, which scales better.
Same graph, same neighbors — but different pooling strategies. See how the weighted version correctly focuses on strongly-connected neighbors.
PinSage is trained with a max-margin loss. Given a query pin q and a positive pin p (a pin that co-occurs with q on the same board), the loss pushes q and p together while pushing q away from negative pins n:
Where Δ is a margin (set to 0.1 in the paper). The loss is zero when the positive pair is at least Δ more similar than the negative pair. The model is trained on 7.5 billion such triplets.
The simplest strategy: sample random pins as negatives. This works initially — the model quickly learns that a kitchen table pin is more similar to other table pins than to random cat photos. But after a few epochs, random negatives are trivially easy. The model already places them far apart. Gradient signal vanishes. Training stagnates.
Hard negatives are items that are close to the query in embedding space but are NOT actually related to it. They look like positives but aren't. For example, a wooden dining table query might have a plastic patio table as a hard negative — similar shape and function, but a different semantic category for the user's board.
How do you find hard negatives? Run nearest-neighbor search on the current model's embeddings. Pins ranked 2,000–5,000 closest to the query are good hard negatives: close enough to be confusing, far enough that they're probably not actually related (rank 1–2,000 might contain true positives not in the training set).
Curriculum training structures the difficulty progression. In early epochs: only easy (random) negatives. The model learns coarse semantic structure — tables near tables, cats near cats. In later epochs: add one hard negative per triplet. The model learns fine-grained distinctions — dining tables vs. coffee tables, formal chairs vs. casual seating.
This is exactly how humans learn: broad categories first, then fine distinctions. Training all at once on hard negatives fails — the model hasn't learned coarse structure yet and receives confusing gradient signals.
Drag the epoch slider to watch training progress. Hard negatives appear in later epochs. See how they force the model to draw finer decision boundaries.
With 7.5 billion training examples, naively sequential training is impossible. And once you've trained the model, you still need to generate embeddings for all 3 billion nodes. PinSage solves both problems with distinct engineering strategies.
Sampling neighborhoods is CPU-bound (graph traversal, random walks). Running the neural network is GPU-bound (matrix multiplications). In a naive implementation, the GPU waits idle while the CPU prepares mini-batches, and the CPU waits idle while the GPU trains.
PinSage decouples these with a producer-consumer architecture. A pool of CPU workers (producers) continuously sample neighborhoods and assemble mini-batches, storing them in a shared queue. The GPU worker (consumer) pulls from this queue as fast as it can train. GPU utilization stays near 100%.
At inference, you need embeddings for all 3 billion nodes. You can't process them sequentially — that would take weeks. You can't load the whole graph into one machine. The solution exploits the layered structure of GCNs.
A K=2 GCN produces embeddings in two passes:
This is a dependency chain: layer 2 depends on layer 1, which depends on raw features. PinSage exploits this with two MapReduce jobs:
Click "Run Job 1" then "Run Job 2" to watch embeddings flow through the distributed inference pipeline.
PinSage was evaluated on Pinterest's internal recommendation datasets and deployed to production. The results are striking — not just in held-out metrics, but in live A/B tests on real user behavior.
The evaluation metric is hit-rate@K: given a query pin, does the ground-truth related pin appear in the top-K recommended pins? Pins co-saved to the same board are treated as positively related pairs. The held-out test set contains 100,000 such pairs.
PinSage vs. baselines. Click a bar to see what that method does differently.
| Method | Hit-Rate@10 | vs. Best Baseline |
|---|---|---|
| Visual features only | Low | Baseline |
| PinSage, uniform pooling | +~50% | Graph helps |
| PinSage, importance pooling | +46% vs. uniform | Importance matters |
| PinSage, full (+ curriculum) | +150% vs. best baseline | 2.5× better |
PinSage was deployed to Pinterest's "Related Pins" recommendation surface — shown to users when they view a pin. In a live A/B test against the production system (which used a combination of visual and engagement features), PinSage showed significant improvement in user engagement: more saves, more clicks, longer session duration.
The system also powered "More Like This" in Pinterest Lens (visual search), home feed ranking, and email notifications — multiple production surfaces, all using the same pre-computed 3B-node embeddings.
PinSage's 150% improvement over baselines wasn't one big idea — it was five design decisions that each addressed a specific failure mode. Remove any one and performance degrades. This chapter dissects each contribution.
Pinterest had excellent visual embeddings from VGG networks. These capture visual style but not semantic context. Two pins on the same "mid-century modern furniture" board might be a lamp and a chair — visually dissimilar, semantically related. The graph encodes curation decisions made by real users with real taste. That's a richer signal.
Spectral GCNs (original GCN by Kipf & Welling) compute the graph Laplacian L = D - A and perform convolutions in the spectral domain. This requires storing and multiplying by L — a 3B × 3B matrix. Even in sparse format, 18B edges × 8 bytes = 144 GB, just to store the adjacency. Multiplying it in training is out of the question. PinSage's sampling approach avoids this entirely.
GraphSAGE (Hamilton et al., 2017), PinSage's closest ancestor, uses uniform random sampling of neighbors. This treats all neighbors equally regardless of structural importance. PinSage shows +46% from switching to importance-weighted sampling. The intuition: a pin saved to 500 common boards with the query is far more informative than one sharing only 1 board. Uniform sampling ignores this.
With 3 billion pins, any random pair is almost certainly unrelated. Easy negatives become trivial after a few epochs — the model already places them far apart. The loss gradient from easy negatives is near zero: max(0, small - large + 0.1) = 0. No learning happens. Hard negatives (rank 2,000–5,000) are still separated by a small margin — their gradient is nonzero and informative.
Click each component to toggle it on/off and see the estimated impact on hit-rate.
PinSage sits at the intersection of two active research areas: graph neural networks and industrial recommendation systems. Understanding where it came from and where it leads is essential for putting it in context.
GraphSAGE (Hamilton, Ying, Leskovec, 2017) is PinSage's direct academic ancestor — in fact, Rex Ying and Jure Leskovec are co-authors on both papers. GraphSAGE introduced the idea of inductive neighborhood aggregation: sample neighbors, aggregate features, generalize to unseen nodes. PinSage is GraphSAGE scaled to production with three industrial innovations: random walk importance sampling, hard negatives, and MapReduce inference.
The original GCN (Kipf & Welling, 2016) introduced spectral graph convolutions and proved GCNs work on benchmark datasets. PinSage demonstrates they can work 100,000× larger — but only with the right engineering choices.
LightGCN (He et al., 2020) simplifies PinSage further by removing the feature transformation matrices and nonlinearities — just linear aggregation, weighted by interaction frequency. On recommendation benchmarks, LightGCN matches or beats PinSage with less computation. The insight: for collaborative filtering, the graph structure itself is the signal; neural nonlinearities add noise.
NGCF (Neural Graph Collaborative Filtering, Wang et al., 2019) extends the idea to explicit interaction modeling, adding user-item interaction matrices as features. Both are direct descendants of PinSage's framework.
PinSage didn't just scale GraphSAGE — it changed what was possible to do. At 3 billion nodes, the choice of neighborhood sampling method, loss function, and inference strategy is not a hyperparameter — it's the difference between a running system and one that crashes. PinSage is a case study in how industrial scale forces algorithmic innovation.
| Topic | Connection to PinSage |
|---|---|
| GraphSAGE (Hamilton et al., 2017) | Direct academic ancestor — inductive GCN sampling |
| LightGCN (He et al., 2020) | Simplification: remove nonlinearities, pure aggregation |
| Word2Vec / Item2Vec | Alternative: embedding co-occurrence without graph structure |
| Approximate Nearest Neighbors (FAISS) | How PinSage embeddings are retrieved at serving time |
| Component | Formula / Key Number |
|---|---|
| Random walk neighborhood | T=10 walks, L=50 steps, top K=25 nodes |
| Importance weight | αv = c(v) / Σ c(v′) |
| Convolve output | zu = L2-norm(ReLU(W · [hu ‖ γ(N(u))] + b)) |
| Max-margin loss | max(0, zq·zn − zq·zp + 0.1) |
| Hard negative ranks | 2,000 – 5,000 in current embedding space |
| Scale | 3B nodes, 18B edges, 7.5B training examples |
| Gain: importance pooling | +46% hit-rate |
| Gain: curriculum + hard neg. | +12% hit-rate |
| Total gain vs. baseline | +150% hit-rate |
"What I cannot create, I do not understand." — Richard Feynman. If you can implement PinSage's convolve algorithm, sample neighborhoods via random walks, and set up the max-margin loss with curriculum hard negatives, you understand PinSage.