CS224W Lecture 11 — GNNs for Recommender Systems

Chapter 0: The RecSys Problem

You open Netflix. Out of 15,000 titles, they show you 20. Every choice they don't show you is a title you might have watched, but didn't get the chance to. Every choice they do show is a bet they're placing on your taste. How do they make that bet with any confidence?

This is the recommender system problem: given a history of user-item interactions, predict which items a user will want next. The interactions are sparse — most user-item pairs have no data at all. But they form a structure that reveals everything: a graph.

The Bipartite Interaction Graph

Every interaction (user clicked item, user rated movie, user bought product) is an edge in a bipartite graph. "Bipartite" means two kinds of nodes: users on one side, items on the other. Edges only cross between sides — users never connect to users, items never connect to items directly.

Recommending = link prediction on a bipartite graph. If user u and item v have no edge, should they? That's the question. A GNN that learns good embeddings for both sides can answer it by measuring how similar the embeddings are.

This reframing is the core insight of this lecture. You're not building some custom recommendation algorithm from scratch — you're doing link prediction, a task GNNs already know how to do. The bipartite structure is the key constraint that shapes everything else.

Bipartite Interaction Graph

5 users (left, warm) and 8 items (right, teal). Solid edges = observed interactions. Dashed edges = predicted future interactions. Click a user to see their neighborhood.

Click any user node to explore their interactions.

Why Is This Hard?

A naïve approach: for each user, find items they haven't seen and rank by popularity. Problem: this ignores that different users have radically different tastes. It recommends the same blockbusters to everyone.

A better approach: find users with similar taste and recommend what they liked. This is collaborative filtering — "collaborate" with similar users to filter items. The challenge is defining "similar taste" at scale, across millions of users, without computing all pairwise comparisons.

The graph gives you similarity for free. Two users are similar if they have many common neighbors (items they both interacted with). Items are similar if many common users liked both. Message passing propagates this structural similarity automatically.

In a bipartite user-item graph, what does it mean for two users to have high "structural similarity"?

They are directly connected to each other in the graph They have the same embedding vector in the model They share many items they both interacted with (many common neighbors) They are both in the training set

Chapter 1: Evaluation — Recall@K

Before we build any model, we need to know how to measure success. This is subtler than it looks. In classification, you predict a label and check if it's right. In recommendation, you predict a ranked list of items — and the user can only evaluate a few of them.

The Candidate Generation + Ranking Pipeline

Production recommender systems use a two-stage pipeline. Stage 1 is candidate generation: from 10 million items, retrieve ~1,000 plausible candidates quickly. Stage 2 is ranking: score those 1,000 candidates more carefully and pick the top K to show the user.

All Items (~10M)

Full item catalog

↓ fast retrieval (ANN search, embedding similarity)

Candidates (~1,000)

Plausible items for this user

↓ expensive scoring (GNN, neural ranker)

Top-K Results (~20)

What the user actually sees

GNNs are typically used in Stage 2 — they're expressive but expensive. Stage 1 uses simpler methods like approximate nearest neighbor (ANN) search over learned embeddings.

Recall@K — The Right Metric

How do we evaluate? We hold out some of a user's interactions as a test set. The model recommends K items. Recall@K measures: of all items the user actually interacted with in the test set, what fraction appeared in the top-K recommendations?

Recall@K = |{recommended items} ∩ {relevant items}| / |{relevant items}|

If a user liked 5 items in the test set and your top-10 recommendations captured 3 of them, Recall@10 = 3/5 = 0.6. Higher is better. Note that Recall@K doesn't care about the order within the K items — only whether the right things were included.

Why not accuracy? Accuracy would reward predicting "not clicked" for everything (since most items are unseen). Recall@K focuses on whether the system surfaces things the user actually wants — a harder and more meaningful bar.

Other Common Metrics

Metric	What It Measures
Recall@K	Coverage of relevant items in top K
Precision@K	Fraction of top K that are relevant
NDCG@K	Recall@K + position weighting
MRR	Rank of first relevant item
Hit Rate@K	Did any relevant item appear in top K?

Train/Test Split

You can't randomly split edges — that would leak future information. Instead, split by time: interactions before time T are training edges, interactions after T are test edges.

Temporal split is critical. Random splits let the model "see the future" during training and inflate metrics. Always split by timestamp in production evaluation.

A user liked 10 movies. Your model recommends 20 movies. 4 of the user's liked movies appear in the 20 recommendations. What is Recall@20?

0.2 (4 out of 20 recommended) 0.4 (4 out of 10 relevant) 4.0 (raw count) 0.8 (1 minus miss rate)

Chapter 2: Collaborative Filtering as Graph Embedding

Before GNNs, the dominant approach to recommendation was matrix factorization. Let's understand it — because NGCF and LightGCN are its direct descendants.

The Interaction Matrix

Stack all user-item interactions into a matrix R where R_ui = 1 if user u interacted with item i, else 0. For 1M users and 1M items, this is a 1M × 1M matrix. But it's incredibly sparse — 99.99% zeros. You can't store it. You can't factorize it directly.

score(u, i) = e_u^T · e_i

Where e_u is row u of U and e_i is row i of V. If two users have similar taste, their embedding vectors should be similar (high dot product with the same items). If an item appeals to certain tastes, its vector should be similar to those users' vectors.

Matrix Factorization Geometry

2D embedding space showing users (circles) and items (squares). Drag the slider to see how embeddings align when users interact with similar items. Items closer to a user = higher predicted score.

Embedding dimension noise 0.30

The Problem With Plain MF

Matrix factorization learns embeddings from the final interaction matrix, but it misses the graph structure. If user A and user B both liked item X, and user B also liked item Y, that's a signal that user A might like item Y too. Plain MF doesn't explicitly propagate this higher-order connectivity — it only looks at direct interactions.

Graph structure = higher-order connectivity. A 2-hop path u → i → v means users u and v share item i. A 3-hop path u → i → v → j means user u is connected to item j through two intermediaries. GNNs capture this automatically. MF does not.

In matrix factorization for recommendation, what does a high dot product e_u^T · e_i mean?

User u has interacted with item i many times User u is a power user with many total interactions The model predicts user u would like item i Item i is popular among all users

Chapter 3: NGCF — Message Passing on the Interaction Graph

The year is 2019. Wang et al. ask: what if we run a GNN directly on the bipartite user-item graph? Not matrix factorization on the adjacency matrix, but real message passing where user embeddings absorb information from items they liked, and item embeddings absorb information from users who liked them.

This is NGCF (Neural Graph Collaborative Filtering). The key idea: learn embeddings not just from which items a user interacted with, but from the collaborative signal — the structure of who liked what, propagated through multiple hops.

The Message Passing Equations

Let e_u⁽⁰⁾ and e_i⁽⁰⁾ be initial embeddings for user u and item i. At each layer l, NGCF updates these embeddings by aggregating from neighbors:

e_u^(l+1) = σ( W₁ e_u^(l) + ∑_{i ∈ N(u)} (1/√|N(u)||N(i)|) · (W₁ e_i^(l) + W₂ (e_i^(l) ⊙ e_u^(l))) )

Breaking this down:

e_u^(l) — user u's embedding at layer l
N(u) — all items user u interacted with (neighbors in bipartite graph)
1/√|N(u)||N(i)| — normalization (prevents high-degree nodes from dominating)
W₁, W₂ — learned weight matrices for the interaction
e_i^(l) ⊙ e_u^(l) — element-wise product captures fine-grained user-item interaction
σ — LeakyReLU nonlinearity

Multi-Hop Propagation

After L layers, each user embedding has absorbed information from items up to L hops away. A 2-layer NGCF lets user u "see" items that users similar to u liked — even if u never interacted with those items directly. This is the collaborative signal made explicit.

Final embedding = concatenation across layers. NGCF doesn't just use the last layer. It concatenates e_u⁽⁰⁾ through e_u^(L) into one long vector before scoring. This preserves both raw and propagated information.

python
class NGCFLayer(nn.Module):
    def __init__(self, in_dim, out_dim):
        super().__init__()
        self.W1 = nn.Linear(in_dim, out_dim)
        self.W2 = nn.Linear(in_dim, out_dim)

    def forward(self, x, edge_index, norm):
        # x: node embeddings [N, d]
        # edge_index: [2, E] bipartite edges
        # norm: 1/sqrt(|N(u)|*|N(i)|) per edge
        row, col = edge_index
        # Message: W1*e_j + W2*(e_j * e_i)
        msg = self.W1(x[col]) + self.W2(x[col] * x[row])
        msg = msg * norm.unsqueeze(-1)
        # Aggregate: sum over neighbors
        out = torch.zeros_like(x)
        out.scatter_add_(0, row.unsqueeze(-1).expand_as(msg), msg)
        out = out + self.W1(x)  # self-connection
        return F.leaky_relu(out, negative_slope=0.2)

In NGCF with L=2 layers, what kind of information does a user embedding capture after message passing?

Only direct interactions (items the user clicked) Only global popularity statistics Direct interactions + items liked by similar users (2-hop neighbors) All items in the catalog equally weighted

Chapter 4: LightGCN — When Less Is More

NGCF added GNN message passing to collaborative filtering and improved over MF. But He et al. (2020) asked a pointed question: are the nonlinearities and feature transformations in NGCF actually helping? What if you stripped everything away and kept only the essential graph operation?

The answer, after careful ablation study, was shocking: the weight matrices and nonlinearities in NGCF don't help — they actually hurt. The version without them, called LightGCN, consistently outperforms NGCF on every benchmark they tested.

The LightGCN Equation

LightGCN removes both the weight matrices (W₁, W₂) and the nonlinearity (σ). What remains is a pure normalized average of neighbor embeddings:

e_u^(l+1) = ∑_{i ∈ N(u)} (1/√|N(u)||N(i)|) · e_i^(l)

e_i^(l+1) = ∑_{u ∈ N(i)} (1/√|N(i)||N(u)|) · e_u^(l)

That's the entire model. No learned weights in the propagation step. No activation function. Just smooth averaging across neighbors, with normalization to prevent high-degree nodes from dominating.

The only learned parameters are the initial embeddings e_u⁽⁰⁾ and e_i⁽⁰⁾. Everything else is deterministic graph operations. The GNN learns by backpropagating through the graph structure to these initial embeddings.

Why Does Simpler Work Better?

For recommendation on interaction graphs, the signal is sparse. Every edge just says "liked" — there are no rich features to transform. Adding weight matrices introduces more parameters to overfit on this sparse signal, with no additional signal to benefit from. The graph structure itself is the inductive bias. Nonlinearities would destroy the smooth collaborative filtering signal the propagation creates.

Final Embedding: Weighted Layer Combination

Like NGCF, LightGCN aggregates across layers. But instead of concatenation, it uses a weighted sum:

e_u = ∑_l=0^L α_l e_u^(l)

By default, α_l = 1/(L+1) — equal weighting. Layer 0 captures the raw embedding (no propagation), layer 1 captures 1-hop neighbors, layer 2 captures 2-hop, and so on. The final embedding blends all of these signals.

LightGCN Propagation — SHOWCASE

Watch how user/item embeddings evolve through LightGCN layers. Adjust layers and observe convergence. The heatmap shows pairwise user-item similarity scores after propagation. Click "Run Propagation" to animate.

Graph layers L 2

α weighting uniform

Press "Run Propagation" to start.

Comparison: NGCF vs LightGCN

Property	NGCF	LightGCN
Weight matrices in propagation	Yes (W₁, W₂)	No
Nonlinearity	LeakyReLU	None
Interaction term (e⊙e)	Yes	No
Layer aggregation	Concatenation	Weighted sum
Parameters	More (W matrices)	Fewer
Gowalla Recall@20	0.1547	0.1830

What is the key reason LightGCN removes weight matrices from its propagation layers?

Weight matrices are computationally too expensive for large graphs Weight matrices cannot be backpropagated through graph operations On sparse interaction graphs, feature transformations overfit and hurt performance — the graph structure alone is the right inductive bias PyTorch Geometric doesn't support learnable weights in GNN layers

Chapter 5: BPR Loss — Bayesian Personalized Ranking

You have user embeddings and item embeddings. You can score any user-item pair with a dot product. But how do you train this? What's the loss function?

The naive approach: treat it as binary classification. Observed interactions are positives (label 1), everything else is negative (label 0). Train with cross-entropy. Problem: the "negatives" aren't really negative — they're just unobserved. A user who didn't click on a book might not have seen it, not dislike it. Treating all unobserved pairs as negatives is wrong and creates a noisy training signal.

Pairwise Ranking Instead

Bayesian Personalized Ranking (BPR) takes a different view: instead of scoring items in isolation, rank them relative to each other. The core assumption is simple — a user prefers items they interacted with over items they didn't interact with. Don't predict absolute scores; predict relative ordering.

The key Bayesian assumption: for user u, observed item i should rank higher than unobserved item j. We don't know if j is truly negative — but statistically, items the user chose to interact with should rank above items they didn't.

The BPR Loss Formula

For each training triplet (u, i, j) where i is an observed interaction and j is a sampled "negative" item:

L_BPR = − ∑_(u,i,j) ln σ( ŷ_ui − ŷ_uj ) + λ ||Θ||²

Where ŷ_ui = e_u^T e_i is the predicted score. The loss pushes the score of the positive item above the score of the negative item. The σ (sigmoid) turns the score difference into a probability, and we maximize that probability (minimize negative log).

The λ||Θ||² term is L2 regularization on the initial embeddings Θ = {e_u⁽⁰⁾, e_i⁽⁰⁾} — it prevents overfitting by keeping embeddings small.

BPR Loss in Action

Three nodes: one user (orange), one positive item (teal), one negative item (red). Watch BPR loss push the positive item's score higher. Adjust separation to see how gradient magnitude changes.

Score gap (ŷ_ui − ŷ_uj) 0.5

Negative Sampling

For each (u, i) positive pair, we need to sample a negative item j. Uniform random sampling is the baseline — pick any unobserved item at random. But it can be improved with hard negative sampling: pick j items that score high for user u but aren't observed interactions. These are "near misses" that the model finds hard to distinguish and learns more from.

python
def bpr_loss(user_emb, pos_emb, neg_emb, l2_reg=1e-4):
    # Dot product scores
    pos_score = (user_emb * pos_emb).sum(dim=-1)  # [batch]
    neg_score = (user_emb * neg_emb).sum(dim=-1)  # [batch]

    # BPR: maximize prob(pos > neg)
    loss = -F.logsigmoid(pos_score - neg_score).mean()

    # L2 regularization on embeddings (not propagated ones)
    reg = l2_reg * (user_emb.norm(2) + pos_emb.norm(2) + neg_emb.norm(2))

    return loss + reg

In BPR loss, the "negative" item j is:

An item the user explicitly disliked (gave a negative rating) A randomly selected item from a separate "dislike" dataset A randomly sampled unobserved item — the assumption is that items the user chose to interact with should rank above items they didn't The item with the lowest embedding norm in the catalog

Chapter 6: Scalability — PinSage

LightGCN on a 10M user / 10M item graph with 1B interactions. Each node has millions of neighbors. A single forward pass would aggregate across all of them. This doesn't fit in GPU memory — not even close. You need a fundamentally different approach to scale.

Pinterest's PinSage (Hamilton et al., 2018) was the first GNN deployed at truly industrial scale: 3 billion pins, 18 billion edges. Its solution to the scalability problem has influenced every production GNN since.

Key Idea: Graph Sampling

Instead of aggregating over all neighbors, PinSage samples a fixed-size neighborhood using random walks with restarts. Start at a node, take random steps in the graph, occasionally restart. Nodes visited most often are the most "important" neighbors — they contribute most to the random walker's visits and thus to the local graph structure.

Random walk sampling is better than random neighbor sampling. High-degree nodes have many connections but not all are equally informative. A node visited frequently in random walks is structurally central — neighboring in graph distance AND in importance. This is PageRank's insight applied to neighborhood selection.

The Localized Convolution

After sampling T top-ranked neighbors per node, PinSage applies a learned aggregation (dense weighted mean) followed by a nonlinearity. Because neighbors are sampled, each mini-batch only requires the local subgraph — not the full graph.

Node u

Pin/item to embed

↓ random walk with restart

Top-T Neighbors

Most visited nodes — importance-weighted

↓ weighted aggregation + MLP

Neighborhood Embedding

h_N(u) = ReLU(W · mean{e_v : v ∈ sampled N(u)})

↓ concat + linear

Node Embedding e_u

Combined self + neighborhood signal

Curriculum Training with Hard Negatives

PinSage also introduced curriculum negative sampling: early in training, sample easy negatives (random items); later, sample hard negatives (items that score high for this user but aren't interactions). This mimics curriculum learning — start with easy examples, graduate to harder ones.

Challenge	PinSage's Solution
Billions of neighbors per node	Random walk importance sampling (top-T)
Full graph doesn't fit in GPU	Mini-batch on sampled local subgraphs
Poor negative sample quality	Curriculum training: easy → hard negatives
Inference at 3B scale	MapReduce + cached embeddings

Why does PinSage use random walk with restart for neighbor sampling instead of just picking random neighbors?

Random walks are faster to compute than random sampling Nodes visited frequently in random walks are structurally important — the walk captures both graph distance and centrality, giving better neighborhoods than uniform random sampling Random walks avoid duplicate neighbors, which random sampling cannot Random walks require less GPU memory than random sampling

Chapter 7: Cold Start — The Achilles Heel

A new user signs up for Spotify. They haven't listened to a single song. Their row in the interaction matrix is empty. Their node in the bipartite graph has no edges. LightGCN's message passing will propagate exactly nothing — their embedding after every layer is still the randomly initialized e_u⁽⁰⁾. They get random recommendations.

This is the cold start problem: new users and new items have no interaction history, so graph-based models have nothing to propagate. It's the single most painful limitation of collaborative filtering approaches.

Cold Start for New Users

Several strategies exist, each with tradeoffs:

Strategy	How It Works	Limitation
Popularity baseline	Recommend globally popular items	Ignores user taste entirely
Onboarding survey	Ask user preferences explicitly	Friction, sparse signal
Side information	Use demographic/context features	Requires feature engineering
Exploration	Show diverse items, learn from clicks	Cold start items, bad initial UX
Transfer learning	Pre-train on another dataset	Domain shift

Cold Start for New Items

New items are arguably harder than new users. A new movie has no watch history. But it has content: a plot summary, a genre, actors, director. Content-based features can bootstrap a new item's embedding from its attributes, before any user has interacted with it.

The hybrid approach: initialize item embeddings from content features (text, image, metadata), then fine-tune them as interactions accumulate. Items start with a reasonable embedding day 1 based on content similarity to other items. The GNN then refines these embeddings as the item gains an interaction history.

Connecting to PinSage's Approach

PinSage directly addresses cold start for new items: because it uses content features (image features extracted by a CNN, text features) as part of the node representation alongside the graph structure, new pins with no interactions still have a meaningful initial embedding. The graph refines it; content bootstraps it.

python
def cold_start_embed(item_features, content_encoder, graph_refiner):
    # Step 1: encode content features (works with 0 interactions)
    e_content = content_encoder(item_features)  # [d]

    # Step 2: if item has neighbors, refine with graph signal
    if has_interactions(item_id):
        e_graph = graph_refiner(item_id)       # LightGCN embedding
        alpha = min(num_interactions / 100, 1.0)  # blend ratio
        e_final = (1 - alpha) * e_content + alpha * e_graph
    else:
        e_final = e_content  # pure content cold start

    return e_final

Why is cold start particularly damaging for GNN-based recommenders like LightGCN?

GNNs require more training data than other models GNNs are slower to run inference for new users Message passing can only propagate along existing edges — a node with no edges stays at its random initialization regardless of how many GNN layers are applied New users have longer embedding vectors that don't fit the model's input dimensions

Chapter 8: GNN vs LLM for RecSys

As LLMs became powerful, researchers asked the obvious question: can we just use a language model for recommendations? Feed the user's history as text, ask the LLM to recommend what comes next. No graph. No specialized architecture. Just prompting.

The surprising answer from 2023-2024 research: for most benchmarks, LLMs are worse at recommendation than LightGCN. Let's understand why — and where LLMs actually win.

Why GNNs Still Win on Cold Data

Recommendation is fundamentally about collaborative signals: what do similar users like? This signal lives in the interaction graph. LLMs have no access to this graph at inference time — they can only use what's in the prompt, and the interaction history of millions of users doesn't fit in a context window.

GNNs excel at:
• Collaborative filtering (leveraging all users)
• Items with rich interaction history
• Sparse ID-based signals
• Scalability to billions of interactions
• Interpretable neighborhood signals

LLMs excel at:
• Cold start (content understanding)
• Reasoning about novel situations
• Cross-domain transfer
• Understanding complex user intent
• Items with rich text descriptions

The Winning Combination: Hybrid Models

The best production systems use both. A common pattern: use LLMs to generate rich item embeddings from text (title, description, reviews), then use GNNs on the interaction graph with those LLM embeddings as node features. LLMs handle cold start and content understanding; GNNs handle collaborative filtering and propagation.

Think of it as division of labor: the LLM is the content expert (what does this item's description mean?), while the GNN is the social expert (who else liked this, and what else did they like?). Recommendation requires both kinds of knowledge.

Emerging Paradigm: LLMs as Graph Reasoners

The frontier is trying to teach LLMs to reason about graphs — describing user-item relationships in text and asking the LLM to make recommendations that account for graph structure. This is an active research area, but current LLMs still struggle with the kind of large-scale collaborative filtering that GNNs do effortlessly.

Aspect	GNN (LightGCN)	LLM	Hybrid
Cold start	Poor	Good	Best
Collaborative filtering	Best	Poor	Good
Content understanding	Poor	Best	Best
Scalability	Best	Expensive	Moderate
Training data needed	Interactions	Text corpus	Both

Why do GNNs outperform LLMs on collaborative filtering benchmarks despite LLMs being more capable models overall?

GNNs are faster so they can process more training examples LLMs cannot process numerical data like user IDs Collaborative filtering requires access to all users' interaction histories simultaneously — a graph GNNs explicitly model, but an LLM context window cannot hold LLMs weren't trained on recommendation datasets

Chapter 9: Connections — Where to Go Next

Recommender systems are a microcosm of the entire GNN ecosystem. Everything we've learned in this lecture connects directly to broader ideas in graph learning.

The Core Ideas to Carry Forward

Reframing matters. Recommendation isn't a special problem — it's link prediction on a bipartite graph. Once you make this connection, the entire GNN toolkit applies. This reframing strategy (see the graph, apply the tool) is how graph ML practitioners think.

Simplicity wins on sparse signals. NGCF added complexity (weight matrices, nonlinearities). LightGCN removed them and won. For sparse interaction data, the graph structure itself is sufficient inductive bias — extra parameters just overfit.

Scale requires sampling. Full-graph GNNs don't scale past tens of thousands of nodes. Production systems need sampling (PinSage's random walk), mini-batching on subgraphs, and inference caching. Scalability isn't an afterthought — it shapes the architecture.

Cold start needs content. Pure collaborative filtering fails for new users and items. Hybrid models that combine content features (LLM embeddings, item metadata) with interaction graph structure are the production standard.

Related Lessons in This Series

Topic	Lesson	Connection
GNN fundamentals	Lecture 3: GNNs	Message passing, node embeddings
Link prediction	Lecture 8: Link Prediction	Scoring edges, training pairs
Knowledge graphs	Lecture 10: Knowledge Graphs	Bipartite-like structure, scoring
Relational data	Lecture 12: Relational Deep Learning	Databases as graphs

Key Papers

NGCF — Wang et al. (2019). "Neural Graph Collaborative Filtering." SIGIR.
LightGCN — He et al. (2020). "LightGCN: Simplifying and Powering Graph Convolution Network for Recommendation." SIGIR.
PinSage — Ying et al. (2018). "Graph Convolutional Neural Networks for Web-Scale Recommender Systems." KDD.
BPR — Rendle et al. (2009). "BPR: Bayesian Personalized Ranking from Implicit Feedback." UAI.

"The purpose of recommender systems is not to show people what they want — it's to help them discover what they didn't know they wanted."
— paraphrasing the RecSys community's guiding principle