You have 100 million users and 10 million items. Most pairs have never interacted. Yet you must predict which items each user wants next — from the graph of who clicked what.
You open Netflix. Out of 15,000 titles, they show you 20. Every choice they don't show you is a title you might have watched, but didn't get the chance to. Every choice they do show is a bet they're placing on your taste. How do they make that bet with any confidence?
This is the recommender system problem: given a history of user-item interactions, predict which items a user will want next. The interactions are sparse — most user-item pairs have no data at all. But they form a structure that reveals everything: a graph.
Every interaction (user clicked item, user rated movie, user bought product) is an edge in a bipartite graph. "Bipartite" means two kinds of nodes: users on one side, items on the other. Edges only cross between sides — users never connect to users, items never connect to items directly.
This reframing is the core insight of this lecture. You're not building some custom recommendation algorithm from scratch — you're doing link prediction, a task GNNs already know how to do. The bipartite structure is the key constraint that shapes everything else.
5 users (left, warm) and 8 items (right, teal). Solid edges = observed interactions. Dashed edges = predicted future interactions. Click a user to see their neighborhood.
A naïve approach: for each user, find items they haven't seen and rank by popularity. Problem: this ignores that different users have radically different tastes. It recommends the same blockbusters to everyone.
A better approach: find users with similar taste and recommend what they liked. This is collaborative filtering — "collaborate" with similar users to filter items. The challenge is defining "similar taste" at scale, across millions of users, without computing all pairwise comparisons.
Before we build any model, we need to know how to measure success. This is subtler than it looks. In classification, you predict a label and check if it's right. In recommendation, you predict a ranked list of items — and the user can only evaluate a few of them.
Production recommender systems use a two-stage pipeline. Stage 1 is candidate generation: from 10 million items, retrieve ~1,000 plausible candidates quickly. Stage 2 is ranking: score those 1,000 candidates more carefully and pick the top K to show the user.
GNNs are typically used in Stage 2 — they're expressive but expensive. Stage 1 uses simpler methods like approximate nearest neighbor (ANN) search over learned embeddings.
How do we evaluate? We hold out some of a user's interactions as a test set. The model recommends K items. Recall@K measures: of all items the user actually interacted with in the test set, what fraction appeared in the top-K recommendations?
If a user liked 5 items in the test set and your top-10 recommendations captured 3 of them, Recall@10 = 3/5 = 0.6. Higher is better. Note that Recall@K doesn't care about the order within the K items — only whether the right things were included.
| Metric | What It Measures |
|---|---|
| Recall@K | Coverage of relevant items in top K |
| Precision@K | Fraction of top K that are relevant |
| NDCG@K | Recall@K + position weighting |
| MRR | Rank of first relevant item |
| Hit Rate@K | Did any relevant item appear in top K? |
You can't randomly split edges — that would leak future information. Instead, split by time: interactions before time T are training edges, interactions after T are test edges.
Before GNNs, the dominant approach to recommendation was matrix factorization. Let's understand it — because NGCF and LightGCN are its direct descendants.
Stack all user-item interactions into a matrix R where Rui = 1 if user u interacted with item i, else 0. For 1M users and 1M items, this is a 1M × 1M matrix. But it's incredibly sparse — 99.99% zeros. You can't store it. You can't factorize it directly.
Matrix factorization (MF) solves this by finding two smaller matrices: U ∈ ℝ|users|×d (user embeddings) and V ∈ ℝ|items|×d (item embeddings), where d ≪ |users|. The prediction for user u and item i is:
Where eu is row u of U and ei is row i of V. If two users have similar taste, their embedding vectors should be similar (high dot product with the same items). If an item appeals to certain tastes, its vector should be similar to those users' vectors.
2D embedding space showing users (circles) and items (squares). Drag the slider to see how embeddings align when users interact with similar items. Items closer to a user = higher predicted score.
Matrix factorization learns embeddings from the final interaction matrix, but it misses the graph structure. If user A and user B both liked item X, and user B also liked item Y, that's a signal that user A might like item Y too. Plain MF doesn't explicitly propagate this higher-order connectivity — it only looks at direct interactions.
The year is 2019. Wang et al. ask: what if we run a GNN directly on the bipartite user-item graph? Not matrix factorization on the adjacency matrix, but real message passing where user embeddings absorb information from items they liked, and item embeddings absorb information from users who liked them.
This is NGCF (Neural Graph Collaborative Filtering). The key idea: learn embeddings not just from which items a user interacted with, but from the collaborative signal — the structure of who liked what, propagated through multiple hops.
Let eu(0) and ei(0) be initial embeddings for user u and item i. At each layer l, NGCF updates these embeddings by aggregating from neighbors:
Breaking this down:
After L layers, each user embedding has absorbed information from items up to L hops away. A 2-layer NGCF lets user u "see" items that users similar to u liked — even if u never interacted with those items directly. This is the collaborative signal made explicit.
python class NGCFLayer(nn.Module): def __init__(self, in_dim, out_dim): super().__init__() self.W1 = nn.Linear(in_dim, out_dim) self.W2 = nn.Linear(in_dim, out_dim) def forward(self, x, edge_index, norm): # x: node embeddings [N, d] # edge_index: [2, E] bipartite edges # norm: 1/sqrt(|N(u)|*|N(i)|) per edge row, col = edge_index # Message: W1*e_j + W2*(e_j * e_i) msg = self.W1(x[col]) + self.W2(x[col] * x[row]) msg = msg * norm.unsqueeze(-1) # Aggregate: sum over neighbors out = torch.zeros_like(x) out.scatter_add_(0, row.unsqueeze(-1).expand_as(msg), msg) out = out + self.W1(x) # self-connection return F.leaky_relu(out, negative_slope=0.2)
NGCF added GNN message passing to collaborative filtering and improved over MF. But He et al. (2020) asked a pointed question: are the nonlinearities and feature transformations in NGCF actually helping? What if you stripped everything away and kept only the essential graph operation?
The answer, after careful ablation study, was shocking: the weight matrices and nonlinearities in NGCF don't help — they actually hurt. The version without them, called LightGCN, consistently outperforms NGCF on every benchmark they tested.
LightGCN removes both the weight matrices (W1, W2) and the nonlinearity (σ). What remains is a pure normalized average of neighbor embeddings:
That's the entire model. No learned weights in the propagation step. No activation function. Just smooth averaging across neighbors, with normalization to prevent high-degree nodes from dominating.
For recommendation on interaction graphs, the signal is sparse. Every edge just says "liked" — there are no rich features to transform. Adding weight matrices introduces more parameters to overfit on this sparse signal, with no additional signal to benefit from. The graph structure itself is the inductive bias. Nonlinearities would destroy the smooth collaborative filtering signal the propagation creates.
Like NGCF, LightGCN aggregates across layers. But instead of concatenation, it uses a weighted sum:
By default, αl = 1/(L+1) — equal weighting. Layer 0 captures the raw embedding (no propagation), layer 1 captures 1-hop neighbors, layer 2 captures 2-hop, and so on. The final embedding blends all of these signals.
Watch how user/item embeddings evolve through LightGCN layers. Adjust layers and observe convergence. The heatmap shows pairwise user-item similarity scores after propagation. Click "Run Propagation" to animate.
| Property | NGCF | LightGCN |
|---|---|---|
| Weight matrices in propagation | Yes (W1, W2) | No |
| Nonlinearity | LeakyReLU | None |
| Interaction term (e⊙e) | Yes | No |
| Layer aggregation | Concatenation | Weighted sum |
| Parameters | More (W matrices) | Fewer |
| Gowalla Recall@20 | 0.1547 | 0.1830 |
You have user embeddings and item embeddings. You can score any user-item pair with a dot product. But how do you train this? What's the loss function?
The naive approach: treat it as binary classification. Observed interactions are positives (label 1), everything else is negative (label 0). Train with cross-entropy. Problem: the "negatives" aren't really negative — they're just unobserved. A user who didn't click on a book might not have seen it, not dislike it. Treating all unobserved pairs as negatives is wrong and creates a noisy training signal.
Bayesian Personalized Ranking (BPR) takes a different view: instead of scoring items in isolation, rank them relative to each other. The core assumption is simple — a user prefers items they interacted with over items they didn't interact with. Don't predict absolute scores; predict relative ordering.
For each training triplet (u, i, j) where i is an observed interaction and j is a sampled "negative" item:
Where ŷui = euT ei is the predicted score. The loss pushes the score of the positive item above the score of the negative item. The σ (sigmoid) turns the score difference into a probability, and we maximize that probability (minimize negative log).
The λ||Θ||2 term is L2 regularization on the initial embeddings Θ = {eu(0), ei(0)} — it prevents overfitting by keeping embeddings small.
Three nodes: one user (orange), one positive item (teal), one negative item (red). Watch BPR loss push the positive item's score higher. Adjust separation to see how gradient magnitude changes.
For each (u, i) positive pair, we need to sample a negative item j. Uniform random sampling is the baseline — pick any unobserved item at random. But it can be improved with hard negative sampling: pick j items that score high for user u but aren't observed interactions. These are "near misses" that the model finds hard to distinguish and learns more from.
python def bpr_loss(user_emb, pos_emb, neg_emb, l2_reg=1e-4): # Dot product scores pos_score = (user_emb * pos_emb).sum(dim=-1) # [batch] neg_score = (user_emb * neg_emb).sum(dim=-1) # [batch] # BPR: maximize prob(pos > neg) loss = -F.logsigmoid(pos_score - neg_score).mean() # L2 regularization on embeddings (not propagated ones) reg = l2_reg * (user_emb.norm(2) + pos_emb.norm(2) + neg_emb.norm(2)) return loss + reg
LightGCN on a 10M user / 10M item graph with 1B interactions. Each node has millions of neighbors. A single forward pass would aggregate across all of them. This doesn't fit in GPU memory — not even close. You need a fundamentally different approach to scale.
Pinterest's PinSage (Hamilton et al., 2018) was the first GNN deployed at truly industrial scale: 3 billion pins, 18 billion edges. Its solution to the scalability problem has influenced every production GNN since.
Instead of aggregating over all neighbors, PinSage samples a fixed-size neighborhood using random walks with restarts. Start at a node, take random steps in the graph, occasionally restart. Nodes visited most often are the most "important" neighbors — they contribute most to the random walker's visits and thus to the local graph structure.
After sampling T top-ranked neighbors per node, PinSage applies a learned aggregation (dense weighted mean) followed by a nonlinearity. Because neighbors are sampled, each mini-batch only requires the local subgraph — not the full graph.
PinSage also introduced curriculum negative sampling: early in training, sample easy negatives (random items); later, sample hard negatives (items that score high for this user but aren't interactions). This mimics curriculum learning — start with easy examples, graduate to harder ones.
| Challenge | PinSage's Solution |
|---|---|
| Billions of neighbors per node | Random walk importance sampling (top-T) |
| Full graph doesn't fit in GPU | Mini-batch on sampled local subgraphs |
| Poor negative sample quality | Curriculum training: easy → hard negatives |
| Inference at 3B scale | MapReduce + cached embeddings |
A new user signs up for Spotify. They haven't listened to a single song. Their row in the interaction matrix is empty. Their node in the bipartite graph has no edges. LightGCN's message passing will propagate exactly nothing — their embedding after every layer is still the randomly initialized eu(0). They get random recommendations.
This is the cold start problem: new users and new items have no interaction history, so graph-based models have nothing to propagate. It's the single most painful limitation of collaborative filtering approaches.
Several strategies exist, each with tradeoffs:
| Strategy | How It Works | Limitation |
|---|---|---|
| Popularity baseline | Recommend globally popular items | Ignores user taste entirely |
| Onboarding survey | Ask user preferences explicitly | Friction, sparse signal |
| Side information | Use demographic/context features | Requires feature engineering |
| Exploration | Show diverse items, learn from clicks | Cold start items, bad initial UX |
| Transfer learning | Pre-train on another dataset | Domain shift |
New items are arguably harder than new users. A new movie has no watch history. But it has content: a plot summary, a genre, actors, director. Content-based features can bootstrap a new item's embedding from its attributes, before any user has interacted with it.
PinSage directly addresses cold start for new items: because it uses content features (image features extracted by a CNN, text features) as part of the node representation alongside the graph structure, new pins with no interactions still have a meaningful initial embedding. The graph refines it; content bootstraps it.
python def cold_start_embed(item_features, content_encoder, graph_refiner): # Step 1: encode content features (works with 0 interactions) e_content = content_encoder(item_features) # [d] # Step 2: if item has neighbors, refine with graph signal if has_interactions(item_id): e_graph = graph_refiner(item_id) # LightGCN embedding alpha = min(num_interactions / 100, 1.0) # blend ratio e_final = (1 - alpha) * e_content + alpha * e_graph else: e_final = e_content # pure content cold start return e_final
As LLMs became powerful, researchers asked the obvious question: can we just use a language model for recommendations? Feed the user's history as text, ask the LLM to recommend what comes next. No graph. No specialized architecture. Just prompting.
The surprising answer from 2023-2024 research: for most benchmarks, LLMs are worse at recommendation than LightGCN. Let's understand why — and where LLMs actually win.
Recommendation is fundamentally about collaborative signals: what do similar users like? This signal lives in the interaction graph. LLMs have no access to this graph at inference time — they can only use what's in the prompt, and the interaction history of millions of users doesn't fit in a context window.
The best production systems use both. A common pattern: use LLMs to generate rich item embeddings from text (title, description, reviews), then use GNNs on the interaction graph with those LLM embeddings as node features. LLMs handle cold start and content understanding; GNNs handle collaborative filtering and propagation.
The frontier is trying to teach LLMs to reason about graphs — describing user-item relationships in text and asking the LLM to make recommendations that account for graph structure. This is an active research area, but current LLMs still struggle with the kind of large-scale collaborative filtering that GNNs do effortlessly.
| Aspect | GNN (LightGCN) | LLM | Hybrid |
|---|---|---|---|
| Cold start | Poor | Good | Best |
| Collaborative filtering | Best | Poor | Good |
| Content understanding | Poor | Best | Best |
| Scalability | Best | Expensive | Moderate |
| Training data needed | Interactions | Text corpus | Both |
Recommender systems are a microcosm of the entire GNN ecosystem. Everything we've learned in this lecture connects directly to broader ideas in graph learning.
| Topic | Lesson | Connection |
|---|---|---|
| GNN fundamentals | Lecture 3: GNNs | Message passing, node embeddings |
| Link prediction | Lecture 8: Link Prediction | Scoring edges, training pairs |
| Knowledge graphs | Lecture 10: Knowledge Graphs | Bipartite-like structure, scoring |
| Relational data | Lecture 12: Relational Deep Learning | Databases as graphs |
"The purpose of recommender systems is not to show people what they want — it's to help them discover what they didn't know they wanted."
— paraphrasing the RecSys community's guiding principle