Wang, He, Wang, Feng, Chua (NUS + Tencent) — SIGIR 2019

NGCF: Neural Graph
Collaborative Filtering

Standard matrix factorization learns user and item embeddings from direct interactions alone — missing the rich patterns encoded in multi-hop connections. NGCF explicitly propagates collaborative signals (who-bought-what-with-whom) through a GNN on the user-item interaction graph, encoding high-order connectivity directly into the embeddings.

Prerequisites: matrix factorization + GCN basics
8
Chapters
4+
Simulations

Chapter 0: What Matrix Factorization Misses

Matrix factorization (MF) is the workhorse of collaborative filtering. It decomposes the user-item interaction matrix into two lower-rank matrices — a user embedding matrix and an item embedding matrix — then predicts the score for any user-item pair as their dot product.

MF learns from direct user-item interactions: Alice bought bread, so Alice's embedding should be close to the bread embedding. But it misses the richer pattern: Alice bought bread AND butter. Bob bought butter. Therefore Bob might like bread. This "second-order" signal — the path Alice → bread ← Betty → butter — is invisible to MF. It only models one-hop user-item relationships, not the multi-hop collaborative signals that make recommendation powerful.

The multi-hop signal isn't an edge case — it's the core of collaborative filtering. "People who bought this also bought" is a 2-hop pattern. "Users similar to you also liked" is a 2-hop pattern through user similarity. MF approximates these patterns implicitly and imperfectly in the embedding geometry. NGCF encodes them explicitly through graph message passing.

The interaction graph perspective: User-item interactions define a bipartite graph: users and items are nodes, interactions are edges. The collaborative signal lives in the paths of this graph. A GNN that propagates information along these paths naturally captures multi-hop patterns. The key question is HOW to propagate — and that's NGCF's contribution.
Multi-hop Collaborative Signal

MF only models direct edges (1-hop). Click to reveal 2-hop and 3-hop signals that MF misses.

What type of collaborative signal does matrix factorization fail to capture?

Chapter 1: Embedding Propagation Layer

NGCF's key idea is embedding propagation: instead of learning embeddings in isolation, refine them by passing messages along the interaction graph. Each propagation layer aggregates information from one additional hop, so after K layers, each user's embedding encodes information from their K-hop neighborhood in the interaction graph.

The propagation is bidirectional on the bipartite graph: users aggregate from items (what items do my neighbors like?), items aggregate from users (what kind of users buy me?). One layer of propagation:

e(l+1)u = LeakyReLU( W1 e(l)u + ∑i ∈ Nu 1√|Nu||Ni| ( W1 e(l)i + W2 (e(l)i ∘ e(l)u) ) )

There are three distinct terms here. First, W₁e_u: the user's own current embedding, projected. Second, W₁e_i summed over neighbors: what each neighboring item says about the user, projected by W₁. Third, W₂(e_i ∘ e_u): an interaction-aware message — the element-wise product of item and user embeddings, projected by W₂. This third term is NGCF's key novelty.

The element-wise product term: e_i ∘ e_u computes dimension-wise affinity — dimension d is large when both the user and item have large values in dimension d, meaning they "agree" on that dimension. This lets the message from item i to user u depend on how well-matched i and u currently are. NGCF calls this the "interaction-aware message" — it encodes the collaborative signal between specific user-item pairs, not just the item's properties in isolation.
What makes NGCF's message from item i to user u "interaction-aware" compared to standard GCN?

Chapter 2: Message Construction — Showcase

Let's dissect exactly what message item i sends to user u in NGCF. The message is:

mu←i = 1√|Nu||Ni| ( W1 ei + W2 (ei ∘ eu) )

Three components combined into one message:

  1. Normalization 1/√(|N_u||N_i|): Symmetric normalization by degree — prevents high-degree nodes from dominating. Same as LightGCN's normalization. Has a clean spectral interpretation: it's the symmetric normalized Laplacian applied to the bipartite graph.
  2. W₁e_i: The item's embedding projected to the aggregation space. This is the item's "intrinsic message" — what it says about itself, independent of who u is. This is what standard GCN uses.
  3. W₂(e_i ∘ e_u): The interaction signal. The element-wise product highlights dimensions where both i and u have large magnitude. W₂ then projects this agreement signal. This term only exists in NGCF and is zero in LightGCN.
Message Construction Showcase

Visualize the three message components for a specific user-item pair. Adjust alignment to see how the interaction term (e_i ∘ e_u) changes.

User-item alignment 50%
What does a large W₂(e_i ∘ e_u) signal indicate about the user-item pair (u, i)?

Chapter 3: Multi-Layer Propagation

NGCF stacks L propagation layers. Each layer aggregates one additional hop. After L layers, each user's embedding encodes the collaborative signal from all users and items within L hops in the interaction graph. The receptive field grows exponentially with layers.

Layer 0: Initial embeddings
e_u⁽⁰⁾, e_i⁽⁰⁾ — random initialization, trained by backprop
↓ propagation layer 1
Layer 1: 1-hop signal
User sees items they've interacted with. Item sees users who've bought it.
↓ propagation layer 2
Layer 2: 2-hop signal
User sees co-buyers of their items. "Users who bought X also bought Y."
↓ propagation layer 3
Layer 3: 3-hop signal
Item sees items that share similar users. Captures item-item similarity via user paths.

After L layers, concatenate all layer embeddings to form the final representation:

e*u = e(0)u || e(1)u || ... || e(L)u

The concatenation is important: it preserves all granularities of collaborative signal. Layer 0 is the intrinsic user representation. Layer L is the richest contextual representation. Using all layers together in the final prediction is consistently better than using only the last layer.

Concatenate vs. mean: NGCF concatenates all L+1 layer embeddings. LightGCN uses a mean (weighted sum). Concatenation doubles or triples the embedding dimension for L=2 or L=3 layers, which increases the parameter count in the final prediction layer but preserves finer-grained information. Empirically, both work similarly — the bigger gap is whether the propagation itself is interaction-aware (NGCF) or pure aggregation (LightGCN).
python
import torch
import torch.nn as nn

class NGCFLayer(nn.Module):
    def __init__(self, dim):
        super().__init__()
        self.W1 = nn.Linear(dim, dim, bias=False)
        self.W2 = nn.Linear(dim, dim, bias=False)
        self.act = nn.LeakyReLU(0.2)

    def forward(self, eu, ei_neighbors, norm_weights):
        # eu: [dim], ei_neighbors: [K, dim], norm_weights: [K]
        # W1 self-term
        self_msg = self.W1(eu)
        # Neighbor messages
        neighbor_sum = torch.zeros_like(eu)
        for k, (ei, w) in enumerate(zip(ei_neighbors, norm_weights)):
            msg = self.W1(ei) + self.W2(ei * eu)  # interaction-aware
            neighbor_sum += w * msg
        return self.act(self_msg + neighbor_sum)
Why does NGCF concatenate embeddings from all L layers rather than using only the final layer?

Chapter 4: BPR Training

NGCF uses the same BPR (Bayesian Personalized Ranking) loss as LightGCN. Given user u, positive item i (interacted), and negative item j (random non-interaction), train the model to rank i above j in u's preferences:

LBPR = − ∑(u,i,j) log σ( ŷui − ŷuj ) + λ||Θ||2

Here ŷui = eu*ᵀ · ei* is the predicted score using the final concatenated embeddings. λ regularizes all learnable parameters Θ = {E⁽⁰⁾, W₁⁽¹⁾, W₂⁽¹⁾, ..., W₁⁽ᴸ⁾, W₂⁽ᴸ⁾} — crucially, this includes the weight matrices at every layer.

The key training challenge for NGCF vs LightGCN: NGCF must regularize 2L weight matrices plus the initial embeddings. LightGCN only regularizes the initial embeddings. With L=3 layers, NGCF has 6 weight matrices (each d×d) plus embeddings. For d=64 and 100K users/items, this is about 6×64² = 25K parameters in weight matrices vs potentially millions in embeddings. The relative regularization between these groups is crucial.

λ in practice: The NGCF paper uses λ=1e-5 (weak regularization). With this setting, the weight matrices can overfit. LightGCN's analysis shows that after removing W, stronger regularization of E⁽⁰⁾ becomes more important — and achieves better generalization. This is a key part of why LightGCN's simpler objective produces better-regularized models on sparse data.
BPR Pairwise Training

For each training step, sample user u, positive item i, negative item j. The loss pushes score(u,i) above score(u,j) by margin. Click to simulate one step.

Why does NGCF need to regularize more parameters than LightGCN during BPR training?

Chapter 5: Results

NGCF is evaluated on three datasets: Gowalla (location check-ins, 29K users, 40K items, 1.3M interactions), Yelp-2018 (business reviews, 32K users, 38K items, 1.6M interactions), and Amazon-Book (13K users, 76K items, 294K interactions — sparser). Metrics: Recall@20 and NDCG@20.

ModelGowalla R@20Gowalla N@20Yelp R@20Amazon R@20
MF-BPR0.12910.11090.05790.0250
SpectralCF0.15300.12980.04260.0315
GC-MC0.13950.12040.04620.0288
NGCF (L=3)0.15700.13270.05790.0344

NGCF outperforms all GCN-based baselines on Gowalla and Amazon-Book. The improvement over MF-BPR on Amazon-Book (+37.6% Recall) is particularly striking — Amazon-Book is sparser, where higher-order signals matter more. When data is dense, 1-hop MF works reasonably well. When sparse, multi-hop signals are essential.

The ablation study: The paper also tests NGCF without the interaction term (removing W₂(e_i ∘ e_u)). Performance drops significantly — confirming the interaction-aware message is genuinely useful, not just extra parameters. This is NGCF's specific contribution: the interaction signal matters, not just any GCN on the interaction graph.

NGCF uses embedding dimension d=64, L=3 layers, batch size 1024, and node dropout (randomly zeroing entire node embeddings during training) plus message dropout (randomly zeroing individual messages). These two dropout variants are important for regularizing the complex model — without them, NGCF with L=3 tends to overfit on smaller datasets.

On which dataset does NGCF show the largest improvement over MF-BPR, and why?

Chapter 6: NGCF vs LightGCN

NGCF (2019) and LightGCN (2020) are natural counterparts — same task, same framework, diametrically opposite design philosophies. NGCF adds: feature transformation matrices W₁, W₂; nonlinear activation (LeakyReLU); interaction-aware messages. LightGCN removes: all weight matrices; all nonlinearities; self-connections. The performance reversal is sharp.

ComponentNGCFLightGCNWhy LightGCN removes it
Weight matrices W₁, W₂YesNoID embeddings don't need projection; adds overfitting
Nonlinear activationLeakyReLUNoneRanking is linear; nonlinearity hurts gradient flow
Interaction term e_i ∘ e_uYesNoRedundant; information already in embeddings
Self-connectionYes (via W₁e_u)NoLayer combination achieves same effect
Layer aggregationConcatenationUniform meanMean is simpler, comparable performance

LightGCN outperforms NGCF by ~17% on Gowalla. But this doesn't mean NGCF's components are universally wrong — just wrong for pure collaborative filtering with ID embeddings. If you have side information (item text, user demographics), the weight matrices make sense: they project the rich features into embedding space. NGCF's architecture would be appropriate for a setting where nodes have meaningful content features.

Historical context: NGCF was published first (SIGIR 2019). LightGCN (SIGIR 2020) was explicitly written as a critique of NGCF's architecture. The LightGCN paper includes an empirical ablation showing each NGCF component reduces performance. This kind of systematic critique-via-ablation is rare in ML papers and made LightGCN particularly influential. The lesson: just because a component is standard in one domain (features + nonlinearity in node classification) doesn't mean it transfers to another (collaborative filtering).
Parameter Count and Complexity Comparison
Layers L 3
LightGCN was designed as a critique of NGCF. What is its core argument?

Chapter 7: Connections

NGCF's lasting contribution isn't the specific architecture — LightGCN showed that simpler works better in the pure CF setting. It's the framework: model the interaction graph as a GNN, propagate collaborative signals through message passing, concatenate multi-layer representations. This framework is the foundation of every subsequent GNN-based recommender system.

PaperKey innovation over NGCFYear
LightGCNRemove W, σ — simpler is better for CF2020
SGL (Self-Supervised GCL)Add contrastive loss for sparse data2021
SimGCLUniform noise augmentation for contrastive2022
NCLNeighborhood-enriched contrastive learning2022
HMLETHybrid propagation (linear + nonlinear)2022

The open question that followed NGCF/LightGCN: what about nodes with few interactions? Sparse users and long-tail items are the hardest recommendation cases — and the 2021-2022 work adds contrastive self-supervised objectives (SGL, SimGCL, NCL) to provide extra training signal for sparse nodes, addressing the limitation that pure BPR on sparse data doesn't generalize well to cold-start or long-tail scenarios.

NGCF's timing was right: Even though LightGCN outperforms it, NGCF was the paper that opened the GNN-for-recommendation research area. It demonstrated convincingly that GNNs on interaction graphs outperform spectral and other GCN methods. The design decisions that LightGCN later removed were reasonable hypotheses at the time — and falsifying hypotheses is progress.

Related lessons

  • LightGCN — The simplified successor
  • GraphSAGE — Inductive GNN
  • LINE — Large-scale network embedding

Key papers

  • Wang et al., SIGIR 2019 (NGCF)
  • He et al., SIGIR 2020 (LightGCN)
  • Rendle et al., UAI 2009 (BPR)
  • Wu et al., SIGIR 2021 (SGL)
"We develop a new framework NGCF for CF that explicitly encodes the collaborative signal in the form of high-order connectivities."
— Wang et al. (2019)