PinSage: GCN for Web-Scale Recommendations

Chapter 0: The Problem

Imagine you're looking at a photo of a rustic wooden dining table. You want more like it. Pinterest has 2 billion such images — how does it find the right ones?

The obvious answer is image similarity: compare pixel features, find visually similar images. But this misses something crucial. Two images might be visually very different — one is a close-up of carved wooden legs, another is an overhead shot of a white tablecloth — yet both belong to the same "Scandinavian dining room" board. They are semantically related even though they look nothing alike.

The bipartite graph of Pinterest

Pinterest's data is naturally a bipartite graph: one set of nodes is pins (images), the other is boards (curated collections). An edge connects a pin to a board when a user saves that pin to that board. Two pins that share many common boards are semantically related — users with similar taste put them together.

This is a link prediction task: given a query pin, find pins likely to share boards with it. The graph structure captures semantic relationships that image content alone misses entirely.

The key insight: Two cake images on the same board are related even if one is chocolate and one is vanilla. Content says "different." Graph says "same." The graph wins for recommendations.

Why this is hard

Pinterest's graph has 3 billion nodes and 18 billion edges. The standard graph neural network approach loads the full graph Laplacian into memory — a matrix of 3B × 3B entries. That's physically impossible. You'd need more RAM than exists on Earth.

Prior GCN research operated on graphs with tens of thousands of nodes. Pinterest needed to scale by a factor of 100,000×. PinSage is how they did it.

Bipartite Pin-Board Graph

Click a pin to highlight its boards and co-pinned neighbors. Notice how graph neighbors reveal semantic similarity that image content (color) alone misses.

Why does the bipartite pin-board graph capture semantic similarity better than raw image features?

Because Pinterest compresses all images to the same size before comparison Because pins saved to the same board by real users reflect human semantic judgment — two visually different images on the same board share meaning Because graph algorithms are faster than convolutional networks

Chapter 1: Why GCNs for Recommendations

A Graph Convolutional Network (GCN) is a neural network that operates on graph-structured data. Each node starts with a feature vector (for a pin: an image embedding + text embedding). Then, iteratively, each node updates its embedding by aggregating the embeddings of its neighbors.

After K rounds of this aggregation, each node's embedding captures information from its K-hop neighborhood. A pin's embedding now reflects not just its own appearance, but the appearance of every pin that's ever been saved alongside it.

Why this is exactly what we need

Consider pin A (a wooden dining table) and pin B (a white tablecloth — visually very different). If they both sit on "Scandinavian Dining" boards, they share many board-neighbors. After 2 rounds of GCN aggregation, both embeddings absorb the same neighborhood signal. They end up close in embedding space. Image similarity fails; GCN succeeds.

The GCN intuition: Each layer of a GCN is like asking "what do my neighbors think?" After K layers, you've polled your neighbors' neighbors' neighbors. Similar items cluster together because their social circles overlap.

The scalability wall

Standard GCNs formulate the update as a matrix multiplication: H^(k+1) = σ(ÖH^(k)W^(k)), where Ö is the normalized adjacency matrix of the full graph. For Pinterest this is a 3B × 3B matrix. Even storing it in sparse format requires terabytes. Multiplying it is out of the question.

PinSage's solution: never construct the full adjacency matrix. Instead, sample a small local neighborhood for each node on-the-fly using random walks, and only build computation graphs for those small samples.

GCN Message Passing

Watch how a node's embedding absorbs information from its neighborhood after each GCN layer. Click "Next Layer" to advance.

Layer 0: nodes show only their own features (color = feature strength).

After K layers of GCN aggregation, what information does a node's embedding capture?

Only the node's own initial features The average of all nodes in the graph Information from the node's K-hop neighborhood — all nodes reachable within K steps

Chapter 2: The PinSage Architecture

PinSage defines one core operation called convolve (Algorithm 1 in the paper). Given a node u and its sampled neighborhood N(u), convolve produces an updated embedding for u. Stack this operation K=2 times and you have the full model.

Algorithm 1: convolve(u, N(u))

Step 1 — Neighbor features

Collect h_v for each v ∈ N(u). These are the current embeddings of u's sampled neighbors. Shape: |N(u)| × d.

↓

Step 2 — Transform

Apply a dense layer to each neighbor: Q · h_v + b. This projects neighbor features into a shared space. Shape: |N(u)| × d′.

↓

Step 3 — Aggregate

γ({transformed h_v}) — a weighted mean using importance weights (visit counts from random walks). Result: one vector of shape d′.

↓

Step 4 — Concatenate

Concatenate the aggregated neighbor signal with u's own current embedding h_u. Shape: 2d′.

↓

Step 5 — Dense + ReLU + L2 normalize

Apply W · concat + b, then ReLU, then L2-normalize. Output: z_u of shape d. L2-normalization means dot product = cosine similarity — perfect for retrieval.

Why L2 normalize at the end?

All embeddings are projected onto the unit hypersphere. This means the dot product between two embeddings equals their cosine similarity. At inference time, finding the most similar pin to a query is equivalent to finding the nearest neighbor on the hypersphere — a well-studied problem with fast approximate solutions (e.g., FAISS).

Two layers, not fifty

PinSage uses K=2 convolve layers. Why so few? Each layer reaches one hop further. Layer 1 aggregates direct neighbors; layer 2 aggregates neighbors-of-neighbors. At Pinterest's graph density, 2-hop neighborhoods already contain extremely informative context. More layers would require sampling exponentially more nodes.

Architecture Data Flow — Click Each Step

Click each step to see the tensor shapes and what's happening inside PinSage's convolve operation.

Why does PinSage L2-normalize the output embedding z_u?

To prevent gradient explosion during training So that dot product between embeddings equals cosine similarity, enabling fast nearest-neighbor retrieval To ensure all embeddings have the same magnitude as the input features

Chapter 3: Random Walk Neighborhoods

Here is the first and most fundamental innovation. A node in Pinterest's graph can have thousands of neighbors — a popular pin might be saved to 50,000 boards, each board connecting it to thousands of other pins. Using all of them is impossible. But which ones do you pick?

The random walk answer

From node u, launch T short random walks of length L. Each walk takes random steps along edges. After all walks, count how many times each node v was visited. Call this count c(v). The neighborhood N(u) is defined as the top-K most-visited nodes.

This is not arbitrary. The visit count c(v) is a Monte Carlo estimate of the Personalized PageRank score of v with respect to u — a well-understood measure of proximity in a graph that naturally down-weights high-degree "hub" nodes.

Why random walks beat uniform sampling: If node u has 10,000 neighbors, sampling 25 uniformly at random gives you a noisy, unweighted view. Random walks naturally visit nearby, well-connected nodes more often — the nodes that are genuinely important to u float to the top.

Parameters: T=10 walks of length L=50

PinSage uses T=10 random walks of length L=50 per node, and keeps the top K=25 most-visited nodes as the neighborhood. These are not magic numbers — they're set so that the walk is long enough to escape the immediate neighborhood but short enough to stay semantically relevant.

The importance weights

The visit counts serve double duty. They not only define WHO is in the neighborhood — they also define HOW MUCH each neighbor contributes during aggregation. A neighbor visited 40 times out of 500 total visits gets weight 40/500 = 0.08. This is the importance pooling used in Step 3 of Algorithm 1.

★ SHOWCASE: Interactive Random Walk Simulator

Click "Start Walk" to simulate random walks from the highlighted node. Watch visit counts accumulate. After enough walks, the top-K neighbors emerge — these become the node's effective neighborhood in PinSage.

Walks (T)10

Walk length (L)8

Top-K neighbors3

Choose parameters and click "Start Walk".

What determines the importance weight of a neighbor v in PinSage's random walk neighborhoods?

How many times v was visited across all random walks from u, normalized by total visits The cosine similarity between u's and v's initial feature vectors The number of boards shared between u and v

Chapter 4: Importance Pooling

Standard GCNs aggregate neighbor features with a uniform mean: every neighbor contributes equally. This treats the node that appears in 1 shared board the same as the node that appears in 200 shared boards. That's clearly wrong.

Weighted mean with random walk scores

Importance pooling replaces the uniform mean with a weighted mean. The weight of neighbor v is its normalized random walk visit count: α_v = c(v) / Σ_v′ c(v′). The aggregated neighborhood representation is then:

γ = Σ_{v ∈ N(u)} α_v · (Q · h_v + b)

Nodes visited frequently during random walks from u are structurally close and well-connected to u — they contribute more information. Nodes visited rarely are marginal connections — they contribute less.

Result from the paper: Importance pooling achieves a 46% improvement in hit-rate over uniform pooling on Pinterest's held-out evaluation set. This single change — switching from uniform to importance-weighted aggregation — accounts for nearly half of PinSage's total gain over the baseline.

What "structurally important" means

A neighbor v visited often by random walks from u is not just close — it's well-connected through multiple paths. If u and v share many common boards, random walks from u frequently wander into v. This is a richer signal than just "are they adjacent?" — it asks "how many routes lead from u to v?"

This is closely related to the Jaccard coefficient (overlap of neighbor sets) but computed implicitly via random walks rather than explicitly, which scales better.

Uniform vs. Importance Pooling

Same graph, same neighbors — but different pooling strategies. See how the weighted version correctly focuses on strongly-connected neighbors.

Click a mode to see how neighbor contributions differ.

In importance pooling, how is each neighbor's weight α_v computed?

As 1 divided by the degree of node u (uniform) As the number of shared boards between u and v As v's random walk visit count from u, divided by total visits across all neighbors

Chapter 5: Training — Hard Negatives & Curriculum

PinSage is trained with a max-margin loss. Given a query pin q and a positive pin p (a pin that co-occurs with q on the same board), the loss pushes q and p together while pushing q away from negative pins n:

J(z_q, z_p, z_n) = max(0, z_q · z_n − z_q · z_p + Δ)

Where Δ is a margin (set to 0.1 in the paper). The loss is zero when the positive pair is at least Δ more similar than the negative pair. The model is trained on 7.5 billion such triplets.

Easy negatives: the problem with random sampling

The simplest strategy: sample random pins as negatives. This works initially — the model quickly learns that a kitchen table pin is more similar to other table pins than to random cat photos. But after a few epochs, random negatives are trivially easy. The model already places them far apart. Gradient signal vanishes. Training stagnates.

Hard negatives: find the impostors

Hard negatives are items that are close to the query in embedding space but are NOT actually related to it. They look like positives but aren't. For example, a wooden dining table query might have a plastic patio table as a hard negative — similar shape and function, but a different semantic category for the user's board.

How do you find hard negatives? Run nearest-neighbor search on the current model's embeddings. Pins ranked 2,000–5,000 closest to the query are good hard negatives: close enough to be confusing, far enough that they're probably not actually related (rank 1–2,000 might contain true positives not in the training set).

Why ranks 2,000–5,000 and not 1–100? The very nearest neighbors might be true positives that simply weren't labeled in the training data — false negatives. Sampling from a middle band (close but not closest) avoids poisoning training with false hard negatives.

Curriculum: easy first, hard later

Curriculum training structures the difficulty progression. In early epochs: only easy (random) negatives. The model learns coarse semantic structure — tables near tables, cats near cats. In later epochs: add one hard negative per triplet. The model learns fine-grained distinctions — dining tables vs. coffee tables, formal chairs vs. casual seating.

This is exactly how humans learn: broad categories first, then fine distinctions. Training all at once on hard negatives fails — the model hasn't learned coarse structure yet and receives confusing gradient signals.

Paper result: Curriculum training with hard negatives achieves a 12% improvement in hit-rate over training with only easy negatives. Combined with importance pooling's 46% gain: these two innovations account for the majority of PinSage's total lift.

Embedding Space: Easy vs. Hard Negatives

Drag the epoch slider to watch training progress. Hard negatives appear in later epochs. See how they force the model to draw finer decision boundaries.

Training epoch0

Epoch 0: random negatives only.

Why does PinSage sample hard negatives from rank 2,000–5,000 in the nearest-neighbor list rather than rank 1–100?

The very nearest neighbors might be true positives (unlabeled co-boards) — sampling them as negatives would poison training with false labels Computing rank 1–100 is more expensive than rank 2,000–5,000 Rank 1–100 items are too different from the query to be informative

Chapter 6: Scaling — Producer-Consumer & MapReduce

With 7.5 billion training examples, naively sequential training is impossible. And once you've trained the model, you still need to generate embeddings for all 3 billion nodes. PinSage solves both problems with distinct engineering strategies.

Training: producer-consumer pipeline

Sampling neighborhoods is CPU-bound (graph traversal, random walks). Running the neural network is GPU-bound (matrix multiplications). In a naive implementation, the GPU waits idle while the CPU prepares mini-batches, and the CPU waits idle while the GPU trains.

PinSage decouples these with a producer-consumer architecture. A pool of CPU workers (producers) continuously sample neighborhoods and assemble mini-batches, storing them in a shared queue. The GPU worker (consumer) pulls from this queue as fast as it can train. GPU utilization stays near 100%.

The key insight: Neighborhood sampling for one mini-batch takes ~40ms. A GPU forward+backward pass takes ~5ms. Without decoupling, the GPU is idle 88% of the time. With producer-consumer, the CPU pre-populates the queue 8 batches ahead, keeping the GPU fed continuously.

Inference: the MapReduce trick

At inference, you need embeddings for all 3 billion nodes. You can't process them sequentially — that would take weeks. You can't load the whole graph into one machine. The solution exploits the layered structure of GCNs.

A K=2 GCN produces embeddings in two passes:

Layer 1 embeddings depend only on raw node features and 1-hop neighbors' raw features.
Layer 2 embeddings depend only on layer-1 embeddings of 2-hop neighbors.

This is a dependency chain: layer 2 depends on layer 1, which depends on raw features. PinSage exploits this with two MapReduce jobs:

MapReduce Job 1

Input: raw features for all 3B nodes. Map: for each node u, fetch neighbors' raw features. Reduce: run convolve layer 1. Output: layer-1 embeddings for all 3B nodes, saved to distributed storage.

↓

MapReduce Job 2

Input: layer-1 embeddings (from Job 1). Map: for each node u, fetch neighbors' layer-1 embeddings. Reduce: run convolve layer 2. Output: final embeddings z_u for all 3B nodes.

↓

Nearest-Neighbor Index

All 3B embeddings are indexed with an approximate nearest-neighbor structure (e.g., FAISS). At serving time, a query pin is embedded and its top-K similar pins are retrieved in milliseconds.

Why not just run inference node by node? Each node needs its neighbors' embeddings. If you process nodes in arbitrary order, you'd recompute neighbor embeddings redundantly — potentially millions of times for popular board nodes. MapReduce ensures each node's embeddings are computed exactly once and shared across all jobs that need them.

MapReduce Inference Pipeline

Click "Run Job 1" then "Run Job 2" to watch embeddings flow through the distributed inference pipeline.

Click "Run Job 1" to start.

Why does PinSage's MapReduce inference use two sequential jobs rather than one?

Two jobs are faster than one on distributed systems The first job handles pins and the second handles boards Layer 2 embeddings depend on layer 1 embeddings — you must compute all layer-1 first, store them, then compute layer-2 using those stored values to avoid redundant recomputation

Chapter 7: Results

PinSage was evaluated on Pinterest's internal recommendation datasets and deployed to production. The results are striking — not just in held-out metrics, but in live A/B tests on real user behavior.

Offline evaluation: hit-rate

The evaluation metric is hit-rate@K: given a query pin, does the ground-truth related pin appear in the top-K recommended pins? Pins co-saved to the same board are treated as positively related pairs. The held-out test set contains 100,000 such pairs.

Hit-Rate Comparison

PinSage vs. baselines. Click a bar to see what that method does differently.

Click a bar to learn about each method.

Key numbers from the paper

Method	Hit-Rate@10	vs. Best Baseline
Visual features only	Low	Baseline
PinSage, uniform pooling	+~50%	Graph helps
PinSage, importance pooling	+46% vs. uniform	Importance matters
PinSage, full (+ curriculum)	+150% vs. best baseline	2.5× better

A/B test: production deployment

PinSage was deployed to Pinterest's "Related Pins" recommendation surface — shown to users when they view a pin. In a live A/B test against the production system (which used a combination of visual and engagement features), PinSage showed significant improvement in user engagement: more saves, more clicks, longer session duration.

The system also powered "More Like This" in Pinterest Lens (visual search), home feed ranking, and email notifications — multiple production surfaces, all using the same pre-computed 3B-node embeddings.

Scale validated: 3 billion nodes, 18 billion edges, 7.5 billion training examples, multiple production deployments. PinSage was the first GCN to operate at this scale — 10,000× larger than any prior GCN application at the time of publication.

What was PinSage's improvement over the best production baseline in hit-rate on Pinterest's evaluation?

12% improvement 46% improvement 150% improvement (2.5× better)

Chapter 8: What Made It Work

PinSage's 150% improvement over baselines wasn't one big idea — it was five design decisions that each addressed a specific failure mode. Remove any one and performance degrades. This chapter dissects each contribution.

Why not just use image embeddings?

Pinterest had excellent visual embeddings from VGG networks. These capture visual style but not semantic context. Two pins on the same "mid-century modern furniture" board might be a lamp and a chair — visually dissimilar, semantically related. The graph encodes curation decisions made by real users with real taste. That's a richer signal.

Why not spectral GCNs?

Spectral GCNs (original GCN by Kipf & Welling) compute the graph Laplacian L = D - A and perform convolutions in the spectral domain. This requires storing and multiplying by L — a 3B × 3B matrix. Even in sparse format, 18B edges × 8 bytes = 144 GB, just to store the adjacency. Multiplying it in training is out of the question. PinSage's sampling approach avoids this entirely.

Why not uniform neighborhood sampling?

GraphSAGE (Hamilton et al., 2017), PinSage's closest ancestor, uses uniform random sampling of neighbors. This treats all neighbors equally regardless of structural importance. PinSage shows +46% from switching to importance-weighted sampling. The intuition: a pin saved to 500 common boards with the query is far more informative than one sharing only 1 board. Uniform sampling ignores this.

Why not easy negatives only?

With 3 billion pins, any random pair is almost certainly unrelated. Easy negatives become trivial after a few epochs — the model already places them far apart. The loss gradient from easy negatives is near zero: max(0, small - large + 0.1) = 0. No learning happens. Hard negatives (rank 2,000–5,000) are still separated by a small margin — their gradient is nonzero and informative.

Ablation Dashboard — Contribution of Each Component

Click each component to toggle it on/off and see the estimated impact on hit-rate.

Which single component accounts for the largest individual gain in PinSage's performance?

Curriculum training with hard negatives (+12%) Importance pooling with random walk visit counts (+46%) Using two GCN layers instead of one

Chapter 9: Connections

PinSage sits at the intersection of two active research areas: graph neural networks and industrial recommendation systems. Understanding where it came from and where it leads is essential for putting it in context.

Where PinSage came from

GraphSAGE (Hamilton, Ying, Leskovec, 2017) is PinSage's direct academic ancestor — in fact, Rex Ying and Jure Leskovec are co-authors on both papers. GraphSAGE introduced the idea of inductive neighborhood aggregation: sample neighbors, aggregate features, generalize to unseen nodes. PinSage is GraphSAGE scaled to production with three industrial innovations: random walk importance sampling, hard negatives, and MapReduce inference.

The original GCN (Kipf & Welling, 2016) introduced spectral graph convolutions and proved GCNs work on benchmark datasets. PinSage demonstrates they can work 100,000× larger — but only with the right engineering choices.

What PinSage led to

LightGCN (He et al., 2020) simplifies PinSage further by removing the feature transformation matrices and nonlinearities — just linear aggregation, weighted by interaction frequency. On recommendation benchmarks, LightGCN matches or beats PinSage with less computation. The insight: for collaborative filtering, the graph structure itself is the signal; neural nonlinearities add noise.

NGCF (Neural Graph Collaborative Filtering, Wang et al., 2019) extends the idea to explicit interaction modeling, adding user-item interaction matrices as features. Both are direct descendants of PinSage's framework.

The broader lesson: scale changes what works

PinSage didn't just scale GraphSAGE — it changed what was possible to do. At 3 billion nodes, the choice of neighborhood sampling method, loss function, and inference strategy is not a hyperparameter — it's the difference between a running system and one that crashes. PinSage is a case study in how industrial scale forces algorithmic innovation.

Key quote from the paper: "The scale of Pinterest's data sets a new bar for graph convolutional networks, being 10,000× larger than problems studied in prior GCN work." — Ying et al., 2018

Related Veanors lessons

Topic	Connection to PinSage
GraphSAGE (Hamilton et al., 2017)	Direct academic ancestor — inductive GCN sampling
LightGCN (He et al., 2020)	Simplification: remove nonlinearities, pure aggregation
Word2Vec / Item2Vec	Alternative: embedding co-occurrence without graph structure
Approximate Nearest Neighbors (FAISS)	How PinSage embeddings are retrieved at serving time

PinSage cheat sheet

Component	Formula / Key Number
Random walk neighborhood	T=10 walks, L=50 steps, top K=25 nodes
Importance weight	α_v = c(v) / Σ c(v′)
Convolve output	z_u = L2-norm(ReLU(W · [h_u ‖ γ(N(u))] + b))
Max-margin loss	max(0, z_q·z_n − z_q·z_p + 0.1)
Hard negative ranks	2,000 – 5,000 in current embedding space
Scale	3B nodes, 18B edges, 7.5B training examples
Gain: importance pooling	+46% hit-rate
Gain: curriculum + hard neg.	+12% hit-rate
Total gain vs. baseline	+150% hit-rate

LightGCN outperforms PinSage on some benchmarks by removing neural nonlinearities and using only linear aggregation. What does this suggest about the role of graph structure vs. neural network complexity in recommendation?

Deeper networks always outperform simpler ones given enough data For collaborative filtering, the graph interaction structure itself is the main signal — neural complexity can add noise rather than expressiveness LightGCN only works because it uses more training data than PinSage

"What I cannot create, I do not understand." — Richard Feynman. If you can implement PinSage's convolve algorithm, sample neighborhoods via random walks, and set up the max-margin loss with curriculum hard negatives, you understand PinSage.

PinSage: Graph Convolutions at Web Scale