CS224W Lecture 11

GNNs for Recommender Systems

You have 100 million users and 10 million items. Most pairs have never interacted. Yet you must predict which items each user wants next — from the graph of who clicked what.

Prerequisites: Basic GNN intuition + what a bipartite graph is. That's it.
10
Chapters
5+
Simulations
0
Assumed Knowledge

Chapter 0: The RecSys Problem

You open Netflix. Out of 15,000 titles, they show you 20. Every choice they don't show you is a title you might have watched, but didn't get the chance to. Every choice they do show is a bet they're placing on your taste. How do they make that bet with any confidence?

This is the recommender system problem: given a history of user-item interactions, predict which items a user will want next. The interactions are sparse — most user-item pairs have no data at all. But they form a structure that reveals everything: a graph.

The Bipartite Interaction Graph

Every interaction (user clicked item, user rated movie, user bought product) is an edge in a bipartite graph. "Bipartite" means two kinds of nodes: users on one side, items on the other. Edges only cross between sides — users never connect to users, items never connect to items directly.

Recommending = link prediction on a bipartite graph. If user u and item v have no edge, should they? That's the question. A GNN that learns good embeddings for both sides can answer it by measuring how similar the embeddings are.

This reframing is the core insight of this lecture. You're not building some custom recommendation algorithm from scratch — you're doing link prediction, a task GNNs already know how to do. The bipartite structure is the key constraint that shapes everything else.

Bipartite Interaction Graph

5 users (left, warm) and 8 items (right, teal). Solid edges = observed interactions. Dashed edges = predicted future interactions. Click a user to see their neighborhood.

Click any user node to explore their interactions.

Why Is This Hard?

A naïve approach: for each user, find items they haven't seen and rank by popularity. Problem: this ignores that different users have radically different tastes. It recommends the same blockbusters to everyone.

A better approach: find users with similar taste and recommend what they liked. This is collaborative filtering — "collaborate" with similar users to filter items. The challenge is defining "similar taste" at scale, across millions of users, without computing all pairwise comparisons.

The graph gives you similarity for free. Two users are similar if they have many common neighbors (items they both interacted with). Items are similar if many common users liked both. Message passing propagates this structural similarity automatically.
In a bipartite user-item graph, what does it mean for two users to have high "structural similarity"?

Chapter 1: Evaluation — Recall@K

Before we build any model, we need to know how to measure success. This is subtler than it looks. In classification, you predict a label and check if it's right. In recommendation, you predict a ranked list of items — and the user can only evaluate a few of them.

The Candidate Generation + Ranking Pipeline

Production recommender systems use a two-stage pipeline. Stage 1 is candidate generation: from 10 million items, retrieve ~1,000 plausible candidates quickly. Stage 2 is ranking: score those 1,000 candidates more carefully and pick the top K to show the user.

All Items (~10M)
Full item catalog
↓ fast retrieval (ANN search, embedding similarity)
Candidates (~1,000)
Plausible items for this user
↓ expensive scoring (GNN, neural ranker)
Top-K Results (~20)
What the user actually sees

GNNs are typically used in Stage 2 — they're expressive but expensive. Stage 1 uses simpler methods like approximate nearest neighbor (ANN) search over learned embeddings.

Recall@K — The Right Metric

How do we evaluate? We hold out some of a user's interactions as a test set. The model recommends K items. Recall@K measures: of all items the user actually interacted with in the test set, what fraction appeared in the top-K recommendations?

Recall@K = |{recommended items} ∩ {relevant items}| / |{relevant items}|

If a user liked 5 items in the test set and your top-10 recommendations captured 3 of them, Recall@10 = 3/5 = 0.6. Higher is better. Note that Recall@K doesn't care about the order within the K items — only whether the right things were included.

Why not accuracy? Accuracy would reward predicting "not clicked" for everything (since most items are unseen). Recall@K focuses on whether the system surfaces things the user actually wants — a harder and more meaningful bar.

Other Common Metrics

MetricWhat It Measures
Recall@KCoverage of relevant items in top K
Precision@KFraction of top K that are relevant
NDCG@KRecall@K + position weighting
MRRRank of first relevant item
Hit Rate@KDid any relevant item appear in top K?

Train/Test Split

You can't randomly split edges — that would leak future information. Instead, split by time: interactions before time T are training edges, interactions after T are test edges.

Temporal split is critical. Random splits let the model "see the future" during training and inflate metrics. Always split by timestamp in production evaluation.
A user liked 10 movies. Your model recommends 20 movies. 4 of the user's liked movies appear in the 20 recommendations. What is Recall@20?

Chapter 2: Collaborative Filtering as Graph Embedding

Before GNNs, the dominant approach to recommendation was matrix factorization. Let's understand it — because NGCF and LightGCN are its direct descendants.

The Interaction Matrix

Stack all user-item interactions into a matrix R where Rui = 1 if user u interacted with item i, else 0. For 1M users and 1M items, this is a 1M × 1M matrix. But it's incredibly sparse — 99.99% zeros. You can't store it. You can't factorize it directly.

Matrix factorization (MF) solves this by finding two smaller matrices: U ∈ ℝ|users|×d (user embeddings) and V ∈ ℝ|items|×d (item embeddings), where d ≪ |users|. The prediction for user u and item i is:

score(u, i) = euT · ei

Where eu is row u of U and ei is row i of V. If two users have similar taste, their embedding vectors should be similar (high dot product with the same items). If an item appeals to certain tastes, its vector should be similar to those users' vectors.

Matrix Factorization Geometry

2D embedding space showing users (circles) and items (squares). Drag the slider to see how embeddings align when users interact with similar items. Items closer to a user = higher predicted score.

Embedding dimension noise 0.30

The Problem With Plain MF

Matrix factorization learns embeddings from the final interaction matrix, but it misses the graph structure. If user A and user B both liked item X, and user B also liked item Y, that's a signal that user A might like item Y too. Plain MF doesn't explicitly propagate this higher-order connectivity — it only looks at direct interactions.

Graph structure = higher-order connectivity. A 2-hop path u → i → v means users u and v share item i. A 3-hop path u → i → v → j means user u is connected to item j through two intermediaries. GNNs capture this automatically. MF does not.
In matrix factorization for recommendation, what does a high dot product euT · ei mean?

Chapter 3: NGCF — Message Passing on the Interaction Graph

The year is 2019. Wang et al. ask: what if we run a GNN directly on the bipartite user-item graph? Not matrix factorization on the adjacency matrix, but real message passing where user embeddings absorb information from items they liked, and item embeddings absorb information from users who liked them.

This is NGCF (Neural Graph Collaborative Filtering). The key idea: learn embeddings not just from which items a user interacted with, but from the collaborative signal — the structure of who liked what, propagated through multiple hops.

The Message Passing Equations

Let eu(0) and ei(0) be initial embeddings for user u and item i. At each layer l, NGCF updates these embeddings by aggregating from neighbors:

eu(l+1) = σ( W1 eu(l) + ∑i ∈ N(u) (1/√|N(u)||N(i)|) · (W1 ei(l) + W2 (ei(l) ⊙ eu(l))) )

Breaking this down:

Multi-Hop Propagation

After L layers, each user embedding has absorbed information from items up to L hops away. A 2-layer NGCF lets user u "see" items that users similar to u liked — even if u never interacted with those items directly. This is the collaborative signal made explicit.

Final embedding = concatenation across layers. NGCF doesn't just use the last layer. It concatenates eu(0) through eu(L) into one long vector before scoring. This preserves both raw and propagated information.
python
class NGCFLayer(nn.Module):
    def __init__(self, in_dim, out_dim):
        super().__init__()
        self.W1 = nn.Linear(in_dim, out_dim)
        self.W2 = nn.Linear(in_dim, out_dim)

    def forward(self, x, edge_index, norm):
        # x: node embeddings [N, d]
        # edge_index: [2, E] bipartite edges
        # norm: 1/sqrt(|N(u)|*|N(i)|) per edge
        row, col = edge_index
        # Message: W1*e_j + W2*(e_j * e_i)
        msg = self.W1(x[col]) + self.W2(x[col] * x[row])
        msg = msg * norm.unsqueeze(-1)
        # Aggregate: sum over neighbors
        out = torch.zeros_like(x)
        out.scatter_add_(0, row.unsqueeze(-1).expand_as(msg), msg)
        out = out + self.W1(x)  # self-connection
        return F.leaky_relu(out, negative_slope=0.2)
In NGCF with L=2 layers, what kind of information does a user embedding capture after message passing?

Chapter 4: LightGCN — When Less Is More

NGCF added GNN message passing to collaborative filtering and improved over MF. But He et al. (2020) asked a pointed question: are the nonlinearities and feature transformations in NGCF actually helping? What if you stripped everything away and kept only the essential graph operation?

The answer, after careful ablation study, was shocking: the weight matrices and nonlinearities in NGCF don't help — they actually hurt. The version without them, called LightGCN, consistently outperforms NGCF on every benchmark they tested.

The LightGCN Equation

LightGCN removes both the weight matrices (W1, W2) and the nonlinearity (σ). What remains is a pure normalized average of neighbor embeddings:

eu(l+1) = ∑i ∈ N(u) (1/√|N(u)||N(i)|) · ei(l)
ei(l+1) = ∑u ∈ N(i) (1/√|N(i)||N(u)|) · eu(l)

That's the entire model. No learned weights in the propagation step. No activation function. Just smooth averaging across neighbors, with normalization to prevent high-degree nodes from dominating.

The only learned parameters are the initial embeddings eu(0) and ei(0). Everything else is deterministic graph operations. The GNN learns by backpropagating through the graph structure to these initial embeddings.

Why Does Simpler Work Better?

For recommendation on interaction graphs, the signal is sparse. Every edge just says "liked" — there are no rich features to transform. Adding weight matrices introduces more parameters to overfit on this sparse signal, with no additional signal to benefit from. The graph structure itself is the inductive bias. Nonlinearities would destroy the smooth collaborative filtering signal the propagation creates.

Final Embedding: Weighted Layer Combination

Like NGCF, LightGCN aggregates across layers. But instead of concatenation, it uses a weighted sum:

eu = ∑l=0L αl eu(l)

By default, αl = 1/(L+1) — equal weighting. Layer 0 captures the raw embedding (no propagation), layer 1 captures 1-hop neighbors, layer 2 captures 2-hop, and so on. The final embedding blends all of these signals.

LightGCN Propagation — SHOWCASE

Watch how user/item embeddings evolve through LightGCN layers. Adjust layers and observe convergence. The heatmap shows pairwise user-item similarity scores after propagation. Click "Run Propagation" to animate.

Graph layers L 2
α weighting uniform
Press "Run Propagation" to start.

Comparison: NGCF vs LightGCN

PropertyNGCFLightGCN
Weight matrices in propagationYes (W1, W2)No
NonlinearityLeakyReLUNone
Interaction term (e⊙e)YesNo
Layer aggregationConcatenationWeighted sum
ParametersMore (W matrices)Fewer
Gowalla Recall@200.15470.1830
What is the key reason LightGCN removes weight matrices from its propagation layers?

Chapter 5: BPR Loss — Bayesian Personalized Ranking

You have user embeddings and item embeddings. You can score any user-item pair with a dot product. But how do you train this? What's the loss function?

The naive approach: treat it as binary classification. Observed interactions are positives (label 1), everything else is negative (label 0). Train with cross-entropy. Problem: the "negatives" aren't really negative — they're just unobserved. A user who didn't click on a book might not have seen it, not dislike it. Treating all unobserved pairs as negatives is wrong and creates a noisy training signal.

Pairwise Ranking Instead

Bayesian Personalized Ranking (BPR) takes a different view: instead of scoring items in isolation, rank them relative to each other. The core assumption is simple — a user prefers items they interacted with over items they didn't interact with. Don't predict absolute scores; predict relative ordering.

The key Bayesian assumption: for user u, observed item i should rank higher than unobserved item j. We don't know if j is truly negative — but statistically, items the user chose to interact with should rank above items they didn't.

The BPR Loss Formula

For each training triplet (u, i, j) where i is an observed interaction and j is a sampled "negative" item:

LBPR = − ∑(u,i,j) ln σ( ŷui − ŷuj ) + λ ||Θ||2

Where ŷui = euT ei is the predicted score. The loss pushes the score of the positive item above the score of the negative item. The σ (sigmoid) turns the score difference into a probability, and we maximize that probability (minimize negative log).

The λ||Θ||2 term is L2 regularization on the initial embeddings Θ = {eu(0), ei(0)} — it prevents overfitting by keeping embeddings small.

BPR Loss in Action

Three nodes: one user (orange), one positive item (teal), one negative item (red). Watch BPR loss push the positive item's score higher. Adjust separation to see how gradient magnitude changes.

Score gap (ŷui − ŷuj) 0.5

Negative Sampling

For each (u, i) positive pair, we need to sample a negative item j. Uniform random sampling is the baseline — pick any unobserved item at random. But it can be improved with hard negative sampling: pick j items that score high for user u but aren't observed interactions. These are "near misses" that the model finds hard to distinguish and learns more from.

python
def bpr_loss(user_emb, pos_emb, neg_emb, l2_reg=1e-4):
    # Dot product scores
    pos_score = (user_emb * pos_emb).sum(dim=-1)  # [batch]
    neg_score = (user_emb * neg_emb).sum(dim=-1)  # [batch]

    # BPR: maximize prob(pos > neg)
    loss = -F.logsigmoid(pos_score - neg_score).mean()

    # L2 regularization on embeddings (not propagated ones)
    reg = l2_reg * (user_emb.norm(2) + pos_emb.norm(2) + neg_emb.norm(2))

    return loss + reg
In BPR loss, the "negative" item j is:

Chapter 6: Scalability — PinSage

LightGCN on a 10M user / 10M item graph with 1B interactions. Each node has millions of neighbors. A single forward pass would aggregate across all of them. This doesn't fit in GPU memory — not even close. You need a fundamentally different approach to scale.

Pinterest's PinSage (Hamilton et al., 2018) was the first GNN deployed at truly industrial scale: 3 billion pins, 18 billion edges. Its solution to the scalability problem has influenced every production GNN since.

Key Idea: Graph Sampling

Instead of aggregating over all neighbors, PinSage samples a fixed-size neighborhood using random walks with restarts. Start at a node, take random steps in the graph, occasionally restart. Nodes visited most often are the most "important" neighbors — they contribute most to the random walker's visits and thus to the local graph structure.

Random walk sampling is better than random neighbor sampling. High-degree nodes have many connections but not all are equally informative. A node visited frequently in random walks is structurally central — neighboring in graph distance AND in importance. This is PageRank's insight applied to neighborhood selection.

The Localized Convolution

After sampling T top-ranked neighbors per node, PinSage applies a learned aggregation (dense weighted mean) followed by a nonlinearity. Because neighbors are sampled, each mini-batch only requires the local subgraph — not the full graph.

Node u
Pin/item to embed
↓ random walk with restart
Top-T Neighbors
Most visited nodes — importance-weighted
↓ weighted aggregation + MLP
Neighborhood Embedding
hN(u) = ReLU(W · mean{ev : v ∈ sampled N(u)})
↓ concat + linear
Node Embedding eu
Combined self + neighborhood signal

Curriculum Training with Hard Negatives

PinSage also introduced curriculum negative sampling: early in training, sample easy negatives (random items); later, sample hard negatives (items that score high for this user but aren't interactions). This mimics curriculum learning — start with easy examples, graduate to harder ones.

ChallengePinSage's Solution
Billions of neighbors per nodeRandom walk importance sampling (top-T)
Full graph doesn't fit in GPUMini-batch on sampled local subgraphs
Poor negative sample qualityCurriculum training: easy → hard negatives
Inference at 3B scaleMapReduce + cached embeddings
Why does PinSage use random walk with restart for neighbor sampling instead of just picking random neighbors?

Chapter 7: Cold Start — The Achilles Heel

A new user signs up for Spotify. They haven't listened to a single song. Their row in the interaction matrix is empty. Their node in the bipartite graph has no edges. LightGCN's message passing will propagate exactly nothing — their embedding after every layer is still the randomly initialized eu(0). They get random recommendations.

This is the cold start problem: new users and new items have no interaction history, so graph-based models have nothing to propagate. It's the single most painful limitation of collaborative filtering approaches.

Cold Start for New Users

Several strategies exist, each with tradeoffs:

StrategyHow It WorksLimitation
Popularity baselineRecommend globally popular itemsIgnores user taste entirely
Onboarding surveyAsk user preferences explicitlyFriction, sparse signal
Side informationUse demographic/context featuresRequires feature engineering
ExplorationShow diverse items, learn from clicksCold start items, bad initial UX
Transfer learningPre-train on another datasetDomain shift

Cold Start for New Items

New items are arguably harder than new users. A new movie has no watch history. But it has content: a plot summary, a genre, actors, director. Content-based features can bootstrap a new item's embedding from its attributes, before any user has interacted with it.

The hybrid approach: initialize item embeddings from content features (text, image, metadata), then fine-tune them as interactions accumulate. Items start with a reasonable embedding day 1 based on content similarity to other items. The GNN then refines these embeddings as the item gains an interaction history.

Connecting to PinSage's Approach

PinSage directly addresses cold start for new items: because it uses content features (image features extracted by a CNN, text features) as part of the node representation alongside the graph structure, new pins with no interactions still have a meaningful initial embedding. The graph refines it; content bootstraps it.

python
def cold_start_embed(item_features, content_encoder, graph_refiner):
    # Step 1: encode content features (works with 0 interactions)
    e_content = content_encoder(item_features)  # [d]

    # Step 2: if item has neighbors, refine with graph signal
    if has_interactions(item_id):
        e_graph = graph_refiner(item_id)       # LightGCN embedding
        alpha = min(num_interactions / 100, 1.0)  # blend ratio
        e_final = (1 - alpha) * e_content + alpha * e_graph
    else:
        e_final = e_content  # pure content cold start

    return e_final
Why is cold start particularly damaging for GNN-based recommenders like LightGCN?

Chapter 8: GNN vs LLM for RecSys

As LLMs became powerful, researchers asked the obvious question: can we just use a language model for recommendations? Feed the user's history as text, ask the LLM to recommend what comes next. No graph. No specialized architecture. Just prompting.

The surprising answer from 2023-2024 research: for most benchmarks, LLMs are worse at recommendation than LightGCN. Let's understand why — and where LLMs actually win.

Why GNNs Still Win on Cold Data

Recommendation is fundamentally about collaborative signals: what do similar users like? This signal lives in the interaction graph. LLMs have no access to this graph at inference time — they can only use what's in the prompt, and the interaction history of millions of users doesn't fit in a context window.

GNNs excel at:
• Collaborative filtering (leveraging all users)
• Items with rich interaction history
• Sparse ID-based signals
• Scalability to billions of interactions
• Interpretable neighborhood signals
LLMs excel at:
• Cold start (content understanding)
• Reasoning about novel situations
• Cross-domain transfer
• Understanding complex user intent
• Items with rich text descriptions

The Winning Combination: Hybrid Models

The best production systems use both. A common pattern: use LLMs to generate rich item embeddings from text (title, description, reviews), then use GNNs on the interaction graph with those LLM embeddings as node features. LLMs handle cold start and content understanding; GNNs handle collaborative filtering and propagation.

Think of it as division of labor: the LLM is the content expert (what does this item's description mean?), while the GNN is the social expert (who else liked this, and what else did they like?). Recommendation requires both kinds of knowledge.

Emerging Paradigm: LLMs as Graph Reasoners

The frontier is trying to teach LLMs to reason about graphs — describing user-item relationships in text and asking the LLM to make recommendations that account for graph structure. This is an active research area, but current LLMs still struggle with the kind of large-scale collaborative filtering that GNNs do effortlessly.

AspectGNN (LightGCN)LLMHybrid
Cold startPoorGoodBest
Collaborative filteringBestPoorGood
Content understandingPoorBestBest
ScalabilityBestExpensiveModerate
Training data neededInteractionsText corpusBoth
Why do GNNs outperform LLMs on collaborative filtering benchmarks despite LLMs being more capable models overall?

Chapter 9: Connections — Where to Go Next

Recommender systems are a microcosm of the entire GNN ecosystem. Everything we've learned in this lecture connects directly to broader ideas in graph learning.

The Core Ideas to Carry Forward

Reframing matters. Recommendation isn't a special problem — it's link prediction on a bipartite graph. Once you make this connection, the entire GNN toolkit applies. This reframing strategy (see the graph, apply the tool) is how graph ML practitioners think.
Simplicity wins on sparse signals. NGCF added complexity (weight matrices, nonlinearities). LightGCN removed them and won. For sparse interaction data, the graph structure itself is sufficient inductive bias — extra parameters just overfit.
Scale requires sampling. Full-graph GNNs don't scale past tens of thousands of nodes. Production systems need sampling (PinSage's random walk), mini-batching on subgraphs, and inference caching. Scalability isn't an afterthought — it shapes the architecture.
Cold start needs content. Pure collaborative filtering fails for new users and items. Hybrid models that combine content features (LLM embeddings, item metadata) with interaction graph structure are the production standard.

Related Lessons in This Series

TopicLessonConnection
GNN fundamentalsLecture 3: GNNsMessage passing, node embeddings
Link predictionLecture 8: Link PredictionScoring edges, training pairs
Knowledge graphsLecture 10: Knowledge GraphsBipartite-like structure, scoring
Relational dataLecture 12: Relational Deep LearningDatabases as graphs

Key Papers

"The purpose of recommender systems is not to show people what they want — it's to help them discover what they didn't know they wanted."
— paraphrasing the RecSys community's guiding principle