NGCF — Veanors

Chapter 0: What Matrix Factorization Misses

Matrix factorization (MF) is the workhorse of collaborative filtering. It decomposes the user-item interaction matrix into two lower-rank matrices — a user embedding matrix and an item embedding matrix — then predicts the score for any user-item pair as their dot product.

MF learns from direct user-item interactions: Alice bought bread, so Alice's embedding should be close to the bread embedding. But it misses the richer pattern: Alice bought bread AND butter. Bob bought butter. Therefore Bob might like bread. This "second-order" signal — the path Alice → bread ← Betty → butter — is invisible to MF. It only models one-hop user-item relationships, not the multi-hop collaborative signals that make recommendation powerful.

The multi-hop signal isn't an edge case — it's the core of collaborative filtering. "People who bought this also bought" is a 2-hop pattern. "Users similar to you also liked" is a 2-hop pattern through user similarity. MF approximates these patterns implicitly and imperfectly in the embedding geometry. NGCF encodes them explicitly through graph message passing.

The interaction graph perspective: User-item interactions define a bipartite graph: users and items are nodes, interactions are edges. The collaborative signal lives in the paths of this graph. A GNN that propagates information along these paths naturally captures multi-hop patterns. The key question is HOW to propagate — and that's NGCF's contribution.

Multi-hop Collaborative Signal

MF only models direct edges (1-hop). Click to reveal 2-hop and 3-hop signals that MF misses.

What type of collaborative signal does matrix factorization fail to capture?

Multi-hop paths through the interaction graph — e.g., user A liked item 1, item 1 was also liked by user B, user B also liked item 2. MF only models direct user-item edges The absolute popularity of items — MF ignores how many users have interacted with each item Temporal patterns — MF cannot model the order in which users interact with items

Chapter 1: Embedding Propagation Layer

NGCF's key idea is embedding propagation: instead of learning embeddings in isolation, refine them by passing messages along the interaction graph. Each propagation layer aggregates information from one additional hop, so after K layers, each user's embedding encodes information from their K-hop neighborhood in the interaction graph.

The propagation is bidirectional on the bipartite graph: users aggregate from items (what items do my neighbors like?), items aggregate from users (what kind of users buy me?). One layer of propagation:

e^(l+1)_u = LeakyReLU( W₁ e^(l)_u + ∑_{i ∈ N_u} ¹⁄_{√|N_u||N_i|} ( W₁ e^(l)_i + W₂ (e^(l)_i ∘ e^(l)_u) ) )

There are three distinct terms here. First, W₁e_u: the user's own current embedding, projected. Second, W₁e_i summed over neighbors: what each neighboring item says about the user, projected by W₁. Third, W₂(e_i ∘ e_u): an interaction-aware message — the element-wise product of item and user embeddings, projected by W₂. This third term is NGCF's key novelty.

The element-wise product term: e_i ∘ e_u computes dimension-wise affinity — dimension d is large when both the user and item have large values in dimension d, meaning they "agree" on that dimension. This lets the message from item i to user u depend on how well-matched i and u currently are. NGCF calls this the "interaction-aware message" — it encodes the collaborative signal between specific user-item pairs, not just the item's properties in isolation.

What makes NGCF's message from item i to user u "interaction-aware" compared to standard GCN?

It includes the element-wise product e_i ∘ e_u — the message from item i depends on how well-matched user u's current embedding is with item i's embedding It uses an attention mechanism to weight each neighbor's contribution by learned importance scores It uses two separate weight matrices W₁ and W₂ instead of one, doubling the expressiveness

Chapter 2: Message Construction — Showcase

Let's dissect exactly what message item i sends to user u in NGCF. The message is:

m_u←i = ¹⁄_{√|N_u||N_i|} ( W₁ e_i + W₂ (e_i ∘ e_u) )

Three components combined into one message:

Normalization 1/√(|N_u||N_i|): Symmetric normalization by degree — prevents high-degree nodes from dominating. Same as LightGCN's normalization. Has a clean spectral interpretation: it's the symmetric normalized Laplacian applied to the bipartite graph.
W₁e_i: The item's embedding projected to the aggregation space. This is the item's "intrinsic message" — what it says about itself, independent of who u is. This is what standard GCN uses.
W₂(e_i ∘ e_u): The interaction signal. The element-wise product highlights dimensions where both i and u have large magnitude. W₂ then projects this agreement signal. This term only exists in NGCF and is zero in LightGCN.

Message Construction Showcase

Visualize the three message components for a specific user-item pair. Adjust alignment to see how the interaction term (e_i ∘ e_u) changes.

User-item alignment 50%

What does a large W₂(e_i ∘ e_u) signal indicate about the user-item pair (u, i)?

The user and item embeddings have large values in the same dimensions — they are well-aligned, and the model currently thinks they are a strong match Item i has many interactions (high degree), so its message is amplified The gradient magnitude for this pair is large, so it will receive a large update

Chapter 3: Multi-Layer Propagation

NGCF stacks L propagation layers. Each layer aggregates one additional hop. After L layers, each user's embedding encodes the collaborative signal from all users and items within L hops in the interaction graph. The receptive field grows exponentially with layers.

Layer 0: Initial embeddings

e_u⁽⁰⁾, e_i⁽⁰⁾ — random initialization, trained by backprop

↓ propagation layer 1

Layer 1: 1-hop signal

User sees items they've interacted with. Item sees users who've bought it.

↓ propagation layer 2

Layer 2: 2-hop signal

User sees co-buyers of their items. "Users who bought X also bought Y."

↓ propagation layer 3

Layer 3: 3-hop signal

Item sees items that share similar users. Captures item-item similarity via user paths.

After L layers, concatenate all layer embeddings to form the final representation:

e^*_u = e⁽⁰⁾_u || e⁽¹⁾_u || ... || e^(L)_u

The concatenation is important: it preserves all granularities of collaborative signal. Layer 0 is the intrinsic user representation. Layer L is the richest contextual representation. Using all layers together in the final prediction is consistently better than using only the last layer.

Concatenate vs. mean: NGCF concatenates all L+1 layer embeddings. LightGCN uses a mean (weighted sum). Concatenation doubles or triples the embedding dimension for L=2 or L=3 layers, which increases the parameter count in the final prediction layer but preserves finer-grained information. Empirically, both work similarly — the bigger gap is whether the propagation itself is interaction-aware (NGCF) or pure aggregation (LightGCN).

python
import torch
import torch.nn as nn

class NGCFLayer(nn.Module):
    def __init__(self, dim):
        super().__init__()
        self.W1 = nn.Linear(dim, dim, bias=False)
        self.W2 = nn.Linear(dim, dim, bias=False)
        self.act = nn.LeakyReLU(0.2)

    def forward(self, eu, ei_neighbors, norm_weights):
        # eu: [dim], ei_neighbors: [K, dim], norm_weights: [K]
        # W1 self-term
        self_msg = self.W1(eu)
        # Neighbor messages
        neighbor_sum = torch.zeros_like(eu)
        for k, (ei, w) in enumerate(zip(ei_neighbors, norm_weights)):
            msg = self.W1(ei) + self.W2(ei * eu)  # interaction-aware
            neighbor_sum += w * msg
        return self.act(self_msg + neighbor_sum)

Why does NGCF concatenate embeddings from all L layers rather than using only the final layer?

Each layer captures a different granularity of collaborative signal — 1-hop captures direct interaction patterns, deeper layers capture higher-order patterns. Concatenating all preserves this multi-granular information The final layer embedding has vanishing gradients, so earlier layers are needed to provide stable training signal Concatenation is required for the BPR loss function to work with multi-layer models

Chapter 4: BPR Training

NGCF uses the same BPR (Bayesian Personalized Ranking) loss as LightGCN. Given user u, positive item i (interacted), and negative item j (random non-interaction), train the model to rank i above j in u's preferences:

L_BPR = − ∑_(u,i,j) log σ( ŷ_ui − ŷ_uj ) + λ||Θ||²

Here ŷ_ui = e_u*ᵀ · e_i* is the predicted score using the final concatenated embeddings. λ regularizes all learnable parameters Θ = {E⁽⁰⁾, W₁⁽¹⁾, W₂⁽¹⁾, ..., W₁⁽ᴸ⁾, W₂⁽ᴸ⁾} — crucially, this includes the weight matrices at every layer.

The key training challenge for NGCF vs LightGCN: NGCF must regularize 2L weight matrices plus the initial embeddings. LightGCN only regularizes the initial embeddings. With L=3 layers, NGCF has 6 weight matrices (each d×d) plus embeddings. For d=64 and 100K users/items, this is about 6×64² = 25K parameters in weight matrices vs potentially millions in embeddings. The relative regularization between these groups is crucial.

λ in practice: The NGCF paper uses λ=1e-5 (weak regularization). With this setting, the weight matrices can overfit. LightGCN's analysis shows that after removing W, stronger regularization of E⁽⁰⁾ becomes more important — and achieves better generalization. This is a key part of why LightGCN's simpler objective produces better-regularized models on sparse data.

BPR Pairwise Training

For each training step, sample user u, positive item i, negative item j. The loss pushes score(u,i) above score(u,j) by margin. Click to simulate one step.

Why does NGCF need to regularize more parameters than LightGCN during BPR training?

NGCF has weight matrices W₁ and W₂ at each of L layers — 2L matrices that must be regularized in addition to the initial embeddings. LightGCN has no weight matrices, so only embeddings need regularization NGCF uses concatenation instead of mean pooling, creating a longer final embedding that needs stronger L2 penalty The interaction-aware term e_i ∘ e_u creates numerical instability that requires explicit regularization

Chapter 5: Results

NGCF is evaluated on three datasets: Gowalla (location check-ins, 29K users, 40K items, 1.3M interactions), Yelp-2018 (business reviews, 32K users, 38K items, 1.6M interactions), and Amazon-Book (13K users, 76K items, 294K interactions — sparser). Metrics: Recall@20 and NDCG@20.

Model	Gowalla R@20	Gowalla N@20	Yelp R@20	Amazon R@20
MF-BPR	0.1291	0.1109	0.0579	0.0250
SpectralCF	0.1530	0.1298	0.0426	0.0315
GC-MC	0.1395	0.1204	0.0462	0.0288
NGCF (L=3)	0.1570	0.1327	0.0579	0.0344

NGCF outperforms all GCN-based baselines on Gowalla and Amazon-Book. The improvement over MF-BPR on Amazon-Book (+37.6% Recall) is particularly striking — Amazon-Book is sparser, where higher-order signals matter more. When data is dense, 1-hop MF works reasonably well. When sparse, multi-hop signals are essential.

The ablation study: The paper also tests NGCF without the interaction term (removing W₂(e_i ∘ e_u)). Performance drops significantly — confirming the interaction-aware message is genuinely useful, not just extra parameters. This is NGCF's specific contribution: the interaction signal matters, not just any GCN on the interaction graph.

NGCF uses embedding dimension d=64, L=3 layers, batch size 1024, and node dropout (randomly zeroing entire node embeddings during training) plus message dropout (randomly zeroing individual messages). These two dropout variants are important for regularizing the complex model — without them, NGCF with L=3 tends to overfit on smaller datasets.

On which dataset does NGCF show the largest improvement over MF-BPR, and why?

Amazon-Book (sparser dataset) — with fewer direct interactions per user, multi-hop signals matter more since the model needs to generalize from less direct evidence Gowalla (larger dataset) — more training data allows NGCF's complex architecture to fit the interaction patterns Yelp-2018 — business reviews have clearer semantic patterns that graph propagation can capture

Chapter 6: NGCF vs LightGCN

NGCF (2019) and LightGCN (2020) are natural counterparts — same task, same framework, diametrically opposite design philosophies. NGCF adds: feature transformation matrices W₁, W₂; nonlinear activation (LeakyReLU); interaction-aware messages. LightGCN removes: all weight matrices; all nonlinearities; self-connections. The performance reversal is sharp.

Component	NGCF	LightGCN	Why LightGCN removes it
Weight matrices W₁, W₂	Yes	No	ID embeddings don't need projection; adds overfitting
Nonlinear activation	LeakyReLU	None	Ranking is linear; nonlinearity hurts gradient flow
Interaction term e_i ∘ e_u	Yes	No	Redundant; information already in embeddings
Self-connection	Yes (via W₁e_u)	No	Layer combination achieves same effect
Layer aggregation	Concatenation	Uniform mean	Mean is simpler, comparable performance

LightGCN outperforms NGCF by ~17% on Gowalla. But this doesn't mean NGCF's components are universally wrong — just wrong for pure collaborative filtering with ID embeddings. If you have side information (item text, user demographics), the weight matrices make sense: they project the rich features into embedding space. NGCF's architecture would be appropriate for a setting where nodes have meaningful content features.

Historical context: NGCF was published first (SIGIR 2019). LightGCN (SIGIR 2020) was explicitly written as a critique of NGCF's architecture. The LightGCN paper includes an empirical ablation showing each NGCF component reduces performance. This kind of systematic critique-via-ablation is rare in ML papers and made LightGCN particularly influential. The lesson: just because a component is standard in one domain (features + nonlinearity in node classification) doesn't mean it transfers to another (collaborative filtering).

Parameter Count and Complexity Comparison

Layers L 3

LightGCN was designed as a critique of NGCF. What is its core argument?

Each component NGCF adds — weight matrices, nonlinearity, interaction term — hurts recommendation performance because they're designed for feature-rich tasks, not ID-embedding collaborative filtering NGCF is computationally too slow for large-scale deployment, so a simpler model is needed for practical use The BPR training objective in NGCF is flawed; LightGCN proposes a better loss function

Chapter 7: Connections

NGCF's lasting contribution isn't the specific architecture — LightGCN showed that simpler works better in the pure CF setting. It's the framework: model the interaction graph as a GNN, propagate collaborative signals through message passing, concatenate multi-layer representations. This framework is the foundation of every subsequent GNN-based recommender system.

Paper	Key innovation over NGCF	Year
LightGCN	Remove W, σ — simpler is better for CF	2020
SGL (Self-Supervised GCL)	Add contrastive loss for sparse data	2021
SimGCL	Uniform noise augmentation for contrastive	2022
NCL	Neighborhood-enriched contrastive learning	2022
HMLET	Hybrid propagation (linear + nonlinear)	2022

The open question that followed NGCF/LightGCN: what about nodes with few interactions? Sparse users and long-tail items are the hardest recommendation cases — and the 2021-2022 work adds contrastive self-supervised objectives (SGL, SimGCL, NCL) to provide extra training signal for sparse nodes, addressing the limitation that pure BPR on sparse data doesn't generalize well to cold-start or long-tail scenarios.

NGCF's timing was right: Even though LightGCN outperforms it, NGCF was the paper that opened the GNN-for-recommendation research area. It demonstrated convincingly that GNNs on interaction graphs outperform spectral and other GCN methods. The design decisions that LightGCN later removed were reasonable hypotheses at the time — and falsifying hypotheses is progress.

Related lessons

LightGCN — The simplified successor
GraphSAGE — Inductive GNN
LINE — Large-scale network embedding

Key papers

Wang et al., SIGIR 2019 (NGCF)
He et al., SIGIR 2020 (LightGCN)
Rendle et al., UAI 2009 (BPR)
Wu et al., SIGIR 2021 (SGL)

"We develop a new framework NGCF for CF that explicitly encodes the collaborative signal in the form of high-order connectivities."
— Wang et al. (2019)

NGCF: Neural GraphCollaborative Filtering