Modalities & Methods

Recommender Systems

The engines behind Netflix, Amazon, YouTube, Spotify — choosing what to show you from billions of items. Matrix factorization, embeddings, the two-tower retrieval model, the retrieval-then-ranking funnel, and DLRM.

Prerequisites: An embedding is a learned vector for an item + A dot product measures similarity. That’s it.

Chapters

Simulations

Assumed Knowledge

Chapter 0: The Empty Matrix

Netflix has hundreds of millions of users and tens of thousands of titles. Amazon has billions of products. The core task of a recommender system: for each user, pick the handful of items they’re most likely to want, out of everything available. Get it right and engagement, sales, and satisfaction soar; get it wrong and users leave. It is, quietly, one of the most commercially important applications of machine learning on earth.

The defining difficulty is sparsity. Imagine the giant matrix of users×items, where each cell is “did user U like item I?” The overwhelming majority of cells are empty — any one user has interacted with a vanishing fraction of all items. A typical user has rated maybe 0.01% of the catalog. So the real task is to fill in the missing cells: predict the preference a user would have for items they’ve never seen, from the sparse pattern of what everyone has done. This lesson builds the machinery, from matrix factorization to the two-tower models powering today’s feeds.

The trap: “just recommend the most popular items.” Popularity is a baseline, but it’s the opposite of personalization — everyone gets the same bland feed, and you never surface the niche item a specific user would love. The whole value is in personalization: predicting what this user wants, which means learning from the sparse, individual pattern of interactions, not just the crowd average.

The user×item matrix is mostly empty

Rows are users, columns are items, filled cells are observed interactions. Drag the catalog size: the bigger it gets, the emptier the matrix — that sea of blanks is what we must predict.

catalog size0.40

What is the defining challenge of recommendation?

Items are too large to store Sparsity — the user×item matrix is almost entirely empty, so we must predict the missing preferences Users change too fast

Chapter 1: Collaborative Filtering

The foundational idea is collaborative filtering: recommend based on the collective behavior of many users. The intuition is “users like you also liked…” If you and I have rated many movies similarly, and you loved one I haven’t seen, that’s probably a good recommendation for me. Crucially, this uses only the interaction pattern — who liked what — not any content features of the items. The wisdom is in the crowd’s collective taste.

Early approaches were memory-based: find users (or items) most similar to you by comparing rating vectors, then borrow their preferences. “Find my 50 nearest-neighbor users; recommend what they liked.” Simple and interpretable, but it scales poorly and handles sparsity badly (with so few overlapping ratings, similarity is noisy). The breakthrough was model-based collaborative filtering — learning a compact model of the whole interaction matrix — and its most famous form is matrix factorization, which we build next. Either way, the signal is the same: patterns in who interacts with what, no content needed.

What signal does collaborative filtering use?

The text descriptions of items The pattern of who interacted with what — “users like you also liked…” — not item content Only each user’s own demographics

Chapter 2: Matrix Factorization

Here’s the elegant model that won the Netflix Prize era. Take the sparse user×item matrix and factorize it into two smaller matrices: a user matrix (each user a short vector) and an item matrix (each item a short vector). The predicted preference of a user for an item is the dot product of their two vectors. Train the vectors so that, on the observed cells, the dot products match the known ratings. Then for any unobserved cell, the dot product is your prediction.

The short vectors are called latent factors, and the magic is what they learn. Even though you never tell it what the dimensions mean, the model discovers them: a dimension might come to represent “how action-y vs. romantic,” another “mainstream vs. indie.” A user’s vector says how much they like each latent trait; an item’s vector says how much it has each trait; the dot product lines them up. With, say, 50-dimensional vectors, you compress a matrix of billions of cells into a few million numbers — and crucially, you can now fill in any missing cell. That compression-and-completion is the whole trick.

Worked example by hand

A user’s 2-D latent vector is [0.9, 0.1] (loves action, dislikes romance). Movie A is [1.0, 0.0] (pure action); Movie B is [0.0, 1.0] (pure romance). Predicted preference for A = 0.9×1.0 + 0.1×0.0 = 0.9. For B = 0.9×0.0 + 0.1×1.0 = 0.1. The model correctly predicts this user will love A and not B — purely from the geometry of the vectors, learned from everyone’s ratings. No one labeled “action” or “romance”; the factors emerged.

Factorize the matrix into user × item vectors

The big sparse matrix (left) is approximated by a thin user matrix times a thin item matrix (right). Each user and item becomes a short latent vector; their dot product predicts any cell. Drag the number of latent factors.

latent factors3

In matrix factorization, how is a user’s preference for an item predicted?

By the item’s overall popularity The dot product of the user’s latent vector and the item’s latent vector By counting shared words in descriptions

Chapter 3: Embeddings — users & items in one space

Matrix factorization is really about embeddings — the same idea you’ve seen for words and images, now for users and items. Every user is a vector; every item is a vector; and they live in the same space, arranged so that a user sits near the items they’ll like. Recommendation becomes geometry: to recommend for a user, find the items nearest their vector (highest dot product). Two items with similar audiences end up near each other; two users with similar tastes end up near each other.

This embedding view is liberating. The vectors don’t have to come from factorizing a matrix — they can be computed by a neural network from rich features (a user’s history, age, device; an item’s genre, text, image). That lets you handle cold start (a brand-new item with no interactions still gets a vector from its features) and fuse many signals. Once both users and items are vectors in a shared space with dot-product affinity, the entire modern recsys toolkit follows: build better towers to produce the embeddings, and use fast nearest-neighbor search to retrieve at scale. That’s the next two chapters.

Users and items in a shared embedding space

Items (dots) and a user (star) in one space. The items nearest the user — highest dot product — are the recommendations (highlighted). Drag the user around and watch the recommended items change.

user position0.40

In the embedding view of recommendation, how do you recommend for a user?

Pick random items Find the items nearest the user’s vector (highest dot product) in the shared embedding space Sort items alphabetically

Chapter 4: The Two-Tower Model

The workhorse of modern recommendation is the two-tower (dual-encoder) architecture. One neural network — the user tower — takes all the user’s features (history, context, profile) and produces a user embedding. A separate network — the item tower — takes an item’s features and produces an item embedding. The predicted affinity is the dot product of the two embeddings. Train it so that items a user actually engaged with score higher than items they didn’t (often with a contrastive / in-batch-negatives loss).

Why two separate towers, never mixing until the final dot product? Because that separation is what makes it scale. Since item embeddings don’t depend on the user, you can precompute every item’s embedding once and store them. Then, at serving time, you compute just the user embedding and find its nearest item embeddings with approximate nearest-neighbor search (ANN) — retrieving the top items from billions in milliseconds, without scoring each one. A model that mixed user and item early (a “cross” network) would be more accurate per pair but couldn’t precompute — you’d have to run it for every user×item pair, which is hopeless at scale. The two-tower’s late interaction is the deliberate price for retrievability.

user features

history, context

↓ user tower

user embedding

computed at serving time

· dot product ·

item embedding

precomputed for all items → ANN index

Two towers meet at a dot product

User features go up one tower, item features up the other; they only meet at the final dot product. Because item embeddings don’t depend on the user, they’re precomputed — toggle to see the serving-time shortcut.

Why does the two-tower model keep user and item towers separate until the final dot product?

To save parameters So item embeddings (independent of the user) can be precomputed and retrieved by fast nearest-neighbor search from billions of items Because users and items can’t share a space

Chapter 5: Retrieve, Then Rank

Production recommenders use a two-stage funnel, because no single model can both scan billions of items and score each one precisely. Stage 1 — retrieval (candidate generation): a cheap model (the two-tower, via ANN) narrows billions of items down to a few hundred plausible candidates per user, fast. Stage 2 — ranking: a heavy, accurate model scores just those few hundred candidates in detail — using rich features and full user×item interactions — to produce the final ordered list you see.

The logic is pure computational economy: spend a tiny amount per item across billions (retrieval), then spend a lot per item across only hundreds (ranking). Retrieval optimizes recall (don’t miss good items); ranking optimizes precision (get the order exactly right, often predicting click-through or watch-time). Some systems add a third re-ranking stage for diversity, freshness, and business rules. This funnel — cheap-and-wide then expensive-and-narrow — is the defining architecture of every large-scale feed, search, and ad system. It’s the same divide-the-budget idea everywhere in large-scale ML.

The retrieval → ranking funnel

Billions of items → retrieval (two-tower + ANN, cheap) narrows to hundreds → ranking (heavy model, precise) orders them → the few you see. Drag to watch the funnel narrow; note the cost-per-item flips as the count shrinks.

funnel stageretrieval

Why do production recommenders split into retrieval then ranking?

To use two teams No single model can scan billions AND score each precisely; retrieval cheaply narrows to hundreds, then ranking scores those precisely Ranking is optional

Chapter 6: DLRM & Feature Interactions

The ranking stage is where the heavy models live, and Meta’s DLRM (Deep Learning Recommendation Model) is the canonical example. Its inputs are a mix of categorical features (user ID, item ID, country, device — each mapped through a big embedding table to a vector) and continuous features (age, price — through an MLP). The key step: explicitly model feature interactions — compute dot products between all the feature embeddings, capturing “this user and this category and this device,” which often matters more than any feature alone. Then a final MLP predicts the score (e.g. probability of a click).

Two things make DLRM distinctive. First, those embedding tables are enormous — with billions of unique IDs, the tables can reach hundreds of gigabytes, far larger than the rest of the model. In production recsys, the embedding tables, not the MLPs, are the memory and infrastructure bottleneck (whole systems exist just to shard them across machines). Second, the explicit feature crosses capture the combinatorial signals (user×item×context) that drive click-through prediction. DLRM and its kin power the ranking that decides your feed, your ads, your recommendations — trained on staggering volumes of click data.

DLRM: embeddings + interactions + MLP

Categorical features hit big embedding tables; continuous features hit an MLP; all the feature vectors are crossed (pairwise dot products); a top MLP predicts the click probability. Drag to see the feature-interaction crosses light up.

feature interactions0.50

What is the main infrastructure bottleneck in a production DLRM-style ranker?

The final MLP’s depth The huge embedding tables (billions of IDs → hundreds of GB), far larger than the rest of the model The number of continuous features

Chapter 7: Recommending, Live (showcase)

Put it together: a user lands in the embedding space, retrieval pulls the nearest candidate items via nearest-neighbor search, and ranking reorders them precisely. Move the user and watch their recommendations update; flip on a “diversity” re-rank and see the list spread out. This is a miniature of what runs every time a feed loads.

Embedding space → retrieve → rank

The user (star) sits among items (dots). Retrieve grabs the nearest candidates (ring); the ranked shortlist appears at the side. Move the user to change recommendations; toggle diversity to re-rank for variety instead of pure nearest-neighbor. Watch the funnel pick what you’d see.

user position0.40

Notice the tension: pure relevance clusters very similar items (a row of near-identical thrillers); diversity re-ranking spreads the picks to keep the feed interesting and avoid boring the user. Real systems blend relevance with diversity, freshness, and business goals in that final re-rank — the last, human-judgment-laden step before pixels hit your screen.

Chapter 8: Hard Problems & Modern Directions

Cold start: a brand-new user or item has no interaction history. Solved by leaning on content features (an item’s text/genre gives it an embedding before anyone clicks) and by exploration.
Exploration vs. exploitation: always recommending the predicted-best items means never learning about the rest. Systems must explore (show uncertain items to gather data) — bandit algorithms balance this.
Feedback loops & filter bubbles: the model trains on what it showed, which shapes what users click, which trains the model — a self-reinforcing loop that can narrow content and amplify bias. A serious, under-appreciated risk.
Implicit feedback: most signal is implicit (clicks, watch-time) not explicit ratings — noisier (a click isn’t a like) and missing-not-at-random.

Modern directions push beyond the static two-tower. Sequential / session-based recommenders use transformers to model your recent sequence of actions and predict the next item (SASRec, BERT4Rec) — capturing intent that changes within a session. Graph-based recommenders treat the user–item interactions as a bipartite graph and run a GNN over it (PinSage at Pinterest) — the message-passing idea from the GNN lesson, applied to recommendation. And LLMs are entering the space, for understanding item content and even generating recommendations directly. But the backbone — embeddings, two-tower retrieval, and a ranking funnel — remains the foundation underneath it all.

The feedback loop

The model recommends → the user clicks (mostly) what was shown → that data trains the model → it recommends similarly. Drag the loop strength: too much exploitation and the content narrows into a filter bubble; exploration keeps it open.

exploitation vs exploration0.70

What is the “feedback loop” risk in recommenders?

The model trains too slowly The model trains on what it showed, shaping what users click, which trains it again — a self-reinforcing loop that can narrow content (filter bubbles) Users give too many ratings

Chapter 9: Cheat Sheet & Connections

problem

sparse user×item matrix — predict the missing preferences (personalization)

↓ collaborative filtering → matrix factorization

embeddings

users & items as vectors in one space; dot product = affinity

↓ two-tower (precompute item embeddings)

retrieval

ANN nearest-neighbor: billions → hundreds, cheap (recall)

↓ ranking (DLRM: embeddings + interactions)

ranking + re-rank

heavy model orders hundreds precisely (precision); diversity/freshness

Method	Role
Collaborative filtering	recommend from interaction patterns (“users like you”)
Matrix factorization	user×item → latent vectors; dot product fills missing cells
Two-tower	retrieval at scale via precomputed item embeddings + ANN
DLRM	ranking: big embedding tables + feature interactions + MLP
SASRec / BERT4Rec	sequential: transformer over recent actions → next item
PinSage	graph: GNN over the user–item bipartite graph

Keep exploring

→ Embedding Layers — the learned vectors at the core
→ Contrastive Learning — the in-batch-negatives training of two-tower
→ Similarity Metrics / Vector Databases — the ANN retrieval engine
→ Graph Neural Networks — graph-based recommenders (PinSage)

“What I cannot create, I do not understand.” You just rebuilt the recommender: turn the sparse user×item matrix into embeddings (matrix factorization), produce them with two separate towers so item vectors can be precomputed, retrieve the nearest candidates from billions by approximate nearest-neighbor, and rank the survivors with a heavy interaction model. Cheap-and-wide, then expensive-and-narrow — that’s how a feed is built.