Day In The Life — Web-Scale AI

Search & Recommendations
Research Scientist

Staff-level interview prep: information retrieval, dense embeddings, transformer ranking, recommendation systems, distributed training, and serving a web-scale index.

Prerequisites: Python + Linear Algebra basics + ML fundamentals. That's it.
13
Chapters
13+
Simulations
5
Interview Dimensions

Chapter 0: The Web-Scale AI Scientist's World

A user types "best noise-canceling headphones under $200 for commuting" into a search bar. In under 200 milliseconds, your system must understand the intent (product discovery, not troubleshooting), match it against a web index of 50 billion documents, rank the results by relevance and personalization, inject sponsored results without destroying trust, and return a page that feels magically tailored. If any stage takes too long, the user hits "back" and you lose them forever.

This is not keyword matching. This is a real-time intelligence pipeline that fuses information retrieval, recommendation systems, and transformer-based understanding into a single sub-200ms response. And you are the research scientist who designs, trains, and serves the models that make it work at web scale.

It is 9:00 AM. You badge into the office at Parallel, a $2B web infrastructure company building AI agents with programmatic web access. On your first monitor, the nightly training job for the cross-encoder reranker finished — but the NDCG@10 on the held-out test set dropped 0.8% compared to the previous checkpoint. On your second monitor, the serving team reports that p99 latency for ANN retrieval spiked from 12ms to 45ms after yesterday's index update added 200M new documents. On your third monitor, a Slack thread where the recommendations team is proposing to unify the search embedding model with the recommendation embedding model — one model to rule both explicit queries and implicit browsing signals.

Before lunch, you will diagnose the NDCG regression (a data pipeline bug duplicated 3% of training pairs, biasing the model toward popular documents), implement a quantization experiment to shrink the ANN index by 4x without exceeding the latency budget, and draft a design doc for the unified embedding model (arguing that search and recs should share the first 8 transformer layers but have separate projection heads for their different similarity metrics).

This is the daily reality of a Research Scientist — Web-Scale Search & Recommendations. You sit at the intersection of three domains that are rapidly converging:

DomainWhat it ownsYour daily intersection
Information RetrievalQuery understanding, document indexing, BM25, inverted indexesYou design the scoring functions that decide which documents are relevant
Recommendation SystemsCollaborative filtering, content-based, user modeling, cold startYou build the neural recommenders that personalize results
Transformer ModelsBERT ranking, dense retrieval, cross-encoders, generative retrievalYou train and serve the models that understand both queries and documents
Systems EngineeringDistributed training, ANN indexing, quantization, caching, latency budgetsYou make billion-parameter models serve 50K QPS at p99 < 200ms
Data InfrastructureCrawling, deduplication, feature stores, click logging, ETLYou build the data pipelines that feed everything above
Three domains, one pipeline. The core thesis of this role — and this lesson — is that search, recommendations, and transformers are converging into a single unified system. A modern search engine IS a recommendation system (personalized ranking). A modern recommendation system IS a search engine (retrieval over a content catalog). And both run on transformers. Every chapter prepares you to design, build, debug, and defend this converged system in an interview.

The Numbers That Matter

Staff-level interviewers expect you to have these numbers at your fingertips. Not because precision matters, but because they reveal whether you have actually built systems at this scale or are just talking theoretically.

MetricTypical valueWhy it matters
Web index size50-100 billion documentsDetermines sharding strategy
Query volume50K-500K QPS (peak)Determines fleet size
Latency budget200ms total, p99Every architecture decision optimizes for this
BM25 retrieval10-15ms for top-1000Bottlenecked by slowest shard in scatter-gather
ANN retrieval5-15ms for top-1000Depends on index type (HNSW ~5ms, IVF ~15ms)
Cross-encoder0.5ms per (q,d) pair100 candidates = 50ms with batching on GPU
Embedding dimension128-768Higher = more accurate, more storage
NDCG@10 improvement bar+0.5-1% for a big launchTiny-sounding but impacts millions of queries
Click-through rate30-50% for position 1Falls off exponentially by position
A/B test duration7-14 days minimumCaptures day-of-week and novelty effects

The System You Build

The diagram below traces a single query from arrival to results page. Every box is a system you own or co-own.

1. Query Understanding
Parse the user query. Identify intent (navigational, informational, transactional). Expand with synonyms, spell-correct, classify into verticals. Produce a query embedding.
2. Candidate Retrieval
Two parallel paths: BM25 over inverted index (lexical match) and ANN search over dense embeddings (semantic match). Union the top-k from each. Budget: <20ms for both.
3. Feature Enrichment
For each candidate, fetch features: document quality signals, click-through rate, freshness, user-document affinity. From the feature store. Budget: <5ms (pre-computed, cached).
4. Neural Reranking
Cross-encoder scores each (query, document) pair. This is the most expensive step but the most accurate. Top-100 candidates reranked to top-10. Budget: <50ms with batching.
5. Result Assembly
Blend organic results with sponsored results, apply diversity rules (no more than 2 from same domain), personalize order based on user history. Return final page. Budget: <10ms.
Search Pipeline Overview

Watch a query flow through the full pipeline. Latency counters show where time is spent. Click Inject Latency to simulate an ANN index spike.

Interview Dimensions

Staff-level interviews at companies building web-scale search and recommendations test you across five dimensions. Each chapter maps to one or more:

DimensionWhat they askChapters
CONCEPT"Explain BM25 from first principles. Why does IDF matter?"1, 2, 3, 4, 7
DESIGN"Design a search system that handles 50K QPS with <200ms p99"0, 5, 6, 9, 11
CODE"Implement a two-tower model for candidate retrieval"1, 2, 4, 7, 12
DEBUG"NDCG dropped 2% after a model update. Walk me through your investigation."5, 8, 9, 11
FRONTIER"How will generative retrieval change search architecture?"3, 10, 12

A Typical Week

To make this concrete, here is what a week looks like in this role:

Monday: Review weekend training runs. The new bi-encoder checkpoint trained on 2B additional pairs shows +1.2% NDCG on head queries but -0.3% on tail queries. Investigate: the new training data is biased toward popular queries (head). Solution: rebalance sampling to 60% tail / 40% head in next training run.

Tuesday: Design review for the unified embedding model. Present your architecture (shared BERT backbone, separate projection heads for search and recommendations). Debate with the recommendations team: they want to add user history features into the encoder, which would break the bi-encoder assumption (documents cannot be pre-encoded if the encoding depends on the user). Compromise: keep the document encoder user-independent, add user signals in a post-hoc fusion layer.

Wednesday: Production incident. ANN p99 spiked from 12ms to 60ms after the nightly index rebuild added 200M new vectors. Root cause: the IVF centroids are stale (trained on the old distribution). Quick fix: increase nprobe from 32 to 48 (trades latency headroom for recall). Long-term fix: schedule centroid retraining every 2 weeks instead of monthly.

Thursday: Write code for a new negative mining strategy. Instead of BM25 hard negatives, you are mining from the model's own ANN index (ANCE-style). Build the pipeline: (1) encode all training queries with current model, (2) search ANN index for each, (3) filter out positives, (4) take top-10 as hard negatives. Run initial experiments.

Friday: Analyze the 2-week A/B test results for the new cross-encoder reranker. +1.8% NDCG@10, +0.4% CTR, -0.6% reformulation rate. All statistically significant (p < 0.01). Write the launch decision document with production rollout plan.

A user searches "python" on your search engine. The top result is about the snake, but analytics show 85% of users clicking "python" want the programming language. Where in the pipeline should you fix this?

Chapter 1: Information Retrieval Foundations

Before neural networks, before embeddings, before transformers — there was a surprisingly elegant mathematical framework for finding relevant documents. It has survived 60 years because it works. Every modern search system still uses it as a first-pass retrieval stage, and understanding it deeply is the foundation for everything that follows.

The core problem: you have a query (a few words) and a corpus (billions of documents). You need to score every document against the query and return the top-k most relevant. You have milliseconds, not seconds. The insight that makes this tractable is the inverted index — instead of scanning every document, you build a lookup table that maps each word to the list of documents containing it.

CONCEPT: TF-IDF from First Principles

Imagine you search for "transformer architecture attention." Which documents are most relevant? A naive approach would count how many times each query word appears in each document. But a document about "attention deficit disorder" that mentions "attention" 50 times is not what you want. We need two insights:

Term Frequency (TF) measures how important a word is within a single document. If "transformer" appears 10 times in a paper about transformer architectures, that is meaningful. But raw counts are misleading — a 10,000-word document naturally has more occurrences than a 100-word abstract. So we normalize:

TF(t, d) = count(t, d) / |d|

where count(t, d) is how many times term t appears in document d, and |d| is the total number of terms in d.

Inverse Document Frequency (IDF) measures how rare a word is across the entire corpus. The word "the" appears in nearly every document — it carries almost zero information about relevance. The word "transformer" appears in far fewer documents — it is highly discriminative. IDF captures this:

IDF(t) = log(N / DF(t))

where N is total documents in the corpus and DF(t) is how many documents contain term t. The log compresses the range — a term appearing in 1 of 1,000,000 documents gets IDF ≈ 14, while a term in 100,000 of 1,000,000 gets IDF ≈ 2.3.

The final TF-IDF score for a query-document pair is:

score(q, d) = ∑t ∈ q TF(t, d) · IDF(t)
IDF is the soul of retrieval. Without IDF, the word "the" dominates every score. IDF is why a search for "rare genetic mutation" actually finds genetics papers — "rare," "genetic," and "mutation" all have high IDF, so documents containing all three score astronomically higher than documents with just common words. This is the single most important concept in classical IR.

CONCEPT: BM25 — TF-IDF's Smarter Cousin

TF-IDF has a problem: if "transformer" appears 100 times vs. 10 times, the raw TF score is 10x higher. But is the document really 10x more relevant? Probably not. BM25 (Best Match 25) fixes this with saturation — after enough occurrences, additional ones contribute diminishing returns:

BM25(q, d) = ∑t ∈ q IDF(t) · TF(t,d) · (k1 + 1) / (TF(t,d) + k1 · (1 - b + b · |d|/avgdl))

Two parameters control behavior: k1 (typically 1.2) controls saturation speed — higher means more occurrences still matter. b (typically 0.75) controls document length normalization — 0 means ignore length, 1 means fully penalize long documents. The term avgdl is the average document length in the corpus.

CONCEPT: The Inverted Index

Even with BM25, scoring every document in a 50-billion-document corpus is impossible in real time. The inverted index avoids this entirely. Instead of asking "which words does document X contain?" (a forward index), we ask "which documents contain word Y?" (an inverted index).

For a query with 3 terms, you look up 3 posting lists and intersect them. Instead of scanning 50 billion documents, you scan perhaps 50 million postings. Combined with skip pointers and block-max WAND (Weighted AND) algorithms, you can find the top-1000 BM25 results from a web-scale index in under 10ms.

The inverted index is also the key to index updates. When a new document is crawled, you compute its terms and append to the relevant posting lists. When a document is deleted, you add it to a deletion bitmap. Periodically, you compact the index (merge segments, remove deleted entries). This is exactly how Elasticsearch/Lucene works under the hood: immutable index segments + a merge policy.

At web scale, the inverted index for 50B documents consumes approximately 5-20 TB of storage (depending on compression and the number of indexed fields). This is sharded across 32-128 machines, with each shard holding a contiguous range of document IDs. The coordinator routes queries to all shards in parallel and merges their top-k results using a priority queue.

CODE: BM25 from Scratch

python
import math
from collections import Counter

class BM25:
    def __init__(self, docs, k1=1.2, b=0.75):
        self.k1, self.b = k1, b
        self.N = len(docs)
        self.avgdl = sum(len(d) for d in docs) / self.N
        self.tf = [Counter(d) for d in docs]  # term freq per doc
        self.dl = [len(d) for d in docs]       # doc lengths
        self.df = Counter()                      # doc freq per term
        for d in docs:
            for t in set(d):
                self.df[t] += 1

    def idf(self, term):
        df = self.df.get(term, 0)
        return math.log((1 + self.N - df + 0.5) / (df + 0.5))

    def score(self, query, doc_idx):
        s = 0.0
        dl = self.dl[doc_idx]
        for t in query:
            tf = self.tf[doc_idx].get(t, 0)
            num = tf * (self.k1 + 1)
            denom = tf + self.k1 * (1 - self.b + self.b * dl / self.avgdl)
            s += self.idf(t) * num / denom
        return s

    def rank(self, query, top_k=10):
        scores = [(i, self.score(query, i)) for i in range(self.N)]
        return sorted(scores, key=lambda x: -x[1])[:top_k]

DEBUG: When BM25 Fails

Symptom: User searches "ML model deployment best practices" but top results are all about "model trains" and "deployment of troops." Root cause: BM25 is purely lexical — it cannot understand that "ML" contextualizes "model" to mean machine learning. Fix: Add query expansion (expand "ML" to "machine learning"), or use a learned query classifier to restrict to a vertical. Long-term fix: dense retrieval (Chapter 2).

Symptom: Extremely long documents always rank first. Root cause: b parameter too low (under 0.5), so length normalization is weak. Long documents accumulate more term matches. Fix: Increase b toward 0.75-0.85. Or: normalize by passage, not document.

DESIGN: Inverted Index at Scale

At web scale (50B docs), a single inverted index does not fit in memory. You shard it: partition documents across N machines (by document ID hash or by term range). A query hits all shards in parallel, each returns its local top-k, and a coordinator merges the results. This is called scatter-gather. The latency is dominated by the slowest shard (tail latency), so you over-replicate hot shards and use hedged requests (send the query to 2 replicas, take whichever responds first).

WORKED EXAMPLE: Hand-Computing BM25

Let us compute BM25 by hand for a tiny corpus. This is exactly the kind of calculation interviewers ask you to walk through on a whiteboard.

Corpus (3 documents):

DocTextLength
D1"transformer attention mechanism"3
D2"attention deficit disorder treatment attention"5
D3"transformer architecture design transformer"4

Query: "transformer attention"

Step 1: Compute IDF (N = 3 documents):

IDF("transformer") = log((3 - 2 + 0.5) / (2 + 0.5)) = log(0.6) = -0.51 ← negative! Appears in most docs IDF("attention") = log((3 - 2 + 0.5) / (2 + 0.5)) = log(0.6) = -0.51 ← same frequency

Wait — negative IDF? This happens with the Robertson-Sparck Jones IDF when a term appears in more than half the documents. BM25 implementations either clip to zero or use the corrected formula: IDF = log(1 + (N - DF + 0.5) / (DF + 0.5)). With the corrected formula:

IDF("transformer") = log(1 + (3 - 2 + 0.5) / (2 + 0.5)) = log(1.6) = 0.47 IDF("attention") = log(1 + (3 - 2 + 0.5) / (2 + 0.5)) = log(1.6) = 0.47

Step 2: Compute BM25 per document (k1=1.2, b=0.75, avgdl=4.0):

D1: TF("transformer")=1, TF("attention")=1, |D1|=3 score("transformer") = 0.47 * (1 * 2.2) / (1 + 1.2*(1 - 0.75 + 0.75*3/4)) = 0.47 * 2.2 / 1.975 = 0.523 score("attention") = 0.47 * 2.2 / 1.975 = 0.523 BM25(D1) = 0.523 + 0.523 = 1.046 D2: TF("transformer")=0, TF("attention")=2, |D2|=5 score("transformer") = 0 (term not present) score("attention") = 0.47 * (2 * 2.2) / (2 + 1.2*(1 - 0.75 + 0.75*5/4)) = 0.47 * 4.4 / 3.375 = 0.613 BM25(D2) = 0 + 0.613 = 0.613 D3: TF("transformer")=2, TF("attention")=0, |D3|=4 score("transformer") = 0.47 * (2 * 2.2) / (2 + 1.2*(1 - 0.75 + 0.75*4/4)) = 0.47 * 4.4 / 3.2 = 0.646 score("attention") = 0 (term not present) BM25(D3) = 0.646 + 0 = 0.646

Ranking: D1 (1.046) > D3 (0.646) > D2 (0.613). D1 wins because it contains both query terms. This is the power of multi-term queries: matching all terms is far more valuable than matching one term many times.

Whiteboard tip: Interviewers love asking you to trace BM25 by hand. Memorize the formula, know the corrected IDF, and always note that multi-term matching dominates single-term frequency. If you can do this calculation confidently in 5 minutes, you pass the "do they actually understand IR?" filter.

DESIGN: Block-Max WAND for Top-k

Even with an inverted index, naively scoring every document in the posting lists is too slow for web-scale queries. Block-Max WAND prunes the search: for each block of postings, pre-compute the maximum possible BM25 contribution. If the maximum contribution from remaining terms cannot exceed the current k-th highest score, skip the entire block. This reduces the number of full score computations by 10-100x.

The algorithm maintains a threshold (the score of the current k-th result). For each candidate document, it computes an upper bound on the BM25 score using block-level maxima. Only if the upper bound exceeds the threshold does it compute the exact score. In practice, this means a query over a 10B-document index touches ~0.001% of the postings.

FRONTIER: BM25 is Not Dead (2024-2025)

Despite the rise of neural retrieval, BM25 remains in production at Google, Bing, and every major search engine as the first-stage retriever. Recent work shows that hybrid retrieval (BM25 + dense) consistently outperforms either alone. The SPLADE family of models (2024) learns sparse representations that are compatible with inverted indexes but capture semantic similarity — bridging the gap between lexical and neural retrieval. Anserini + Pyserini provide research-grade BM25 implementations used in TREC competitions.

SPLADE deserves special attention. It uses a BERT encoder to predict term weights for every token in the vocabulary — including tokens NOT present in the original document. This means the document "repairing a dripping tap" gets a non-zero weight for "faucet" and "plumbing" even though those words do not appear. The resulting sparse representation is stored in a standard inverted index, so you get semantic matching with BM25 efficiency. SPLADE++ (2024) achieves within 1-2% of dense retrieval on MS MARCO while being 10x faster to serve.

BM25 Scoring Visualizer

Adjust k1 (saturation) and b (length normalization) to see how BM25 scores change. The curve shows score vs. term frequency for a fixed IDF.

k1 1.2
b 0.75
You have two documents about "neural networks." Doc A is 500 words and mentions the exact query term 5 times. Doc B is 5000 words and mentions the query term 15 times. With BM25 (k1=1.2, b=0.75, avgdl=1000), which scores higher and why?

Chapter 2: Dense Retrieval & Embeddings

BM25 finds documents that share words with your query. But what about synonyms? What about paraphrases? If you search "how to fix a leaky faucet" and a document says "repairing a dripping tap," BM25 scores it at zero — no shared terms. Dense retrieval solves this by representing both queries and documents as vectors in a shared embedding space, where semantic similarity corresponds to geometric proximity.

CONCEPT: Bi-Encoders

A bi-encoder uses two separate encoder networks (often sharing weights) to independently embed the query and the document into fixed-size vectors. At retrieval time, you compute the dot product or cosine similarity between the query vector and every document vector. Because document vectors are pre-computed and stored in an index, retrieval is just a nearest-neighbor search — no neural network inference per document at query time.

score(q, d) = embedq(q)T · embedd(d)

The key insight: the query encoder and document encoder are trained jointly but run independently. You encode all 50 billion documents offline (takes days on a GPU cluster), store the vectors in a vector database, and at query time you only run the query encoder (one forward pass, ~5ms) followed by approximate nearest neighbor search (~10ms).

CONCEPT: Cross-Encoders

A cross-encoder takes the query and document together as a single input, separated by a [SEP] token. This allows full attention between query tokens and document tokens — "leaky" in the query can directly attend to "dripping" in the document. Cross-encoders are dramatically more accurate than bi-encoders but catastrophically slower: you must run the model once per (query, document) pair. At 100ms per pair, scoring 1000 candidates takes 100 seconds. Unusable for first-stage retrieval, perfect for reranking the top-100.

score(q, d) = MLP(BERT([CLS] q [SEP] d [SEP]))

The MLP is typically a single linear layer that maps the 768-d [CLS] hidden state to a scalar relevance score. Some implementations use a two-layer MLP with ReLU activation. The model is fine-tuned end-to-end on (query, document, relevance_label) triples, where labels come from human judgments or debiased click data.

Cross-encoders achieve 5-15% higher NDCG than bi-encoders on benchmarks like MS MARCO and BEIR. The accuracy gap is largest on queries requiring deep semantic understanding ("What president was born in the same state as the inventor of the telephone?" requires connecting multiple facts). It is smallest on keyword-heavy queries where lexical matching suffices.

CONCEPT: Contrastive Training

How do you train a bi-encoder to put semantically similar pairs close together and dissimilar pairs far apart? Contrastive learning. Given a batch of (query, positive document) pairs, you treat every other document in the batch as a negative. The loss pushes positive pairs together and negative pairs apart:

L = -log( exp(sim(q, d+) / τ) / ∑d ∈ batch exp(sim(q, d) / τ) )

This is the InfoNCE loss (also called NT-Xent). The temperature τ controls how sharp the distribution is — lower τ makes the model more discriminative but harder to train. The batch size matters enormously: with a batch of 1024, each query has 1023 negatives, giving a much richer gradient signal than a batch of 32.

Why does temperature matter so much? At τ = 1.0, the softmax is "smooth" — many negatives contribute to the gradient, but the signal per negative is weak. At τ = 0.01, the softmax is "sharp" — only the hardest negatives contribute, giving a strong but potentially noisy gradient. In practice, τ = 0.05-0.1 works best for search embeddings. Lower values (τ = 0.01-0.03) work better for recommendation embeddings where the positive pairs are less semantically distinct.

Research shows that doubling the batch size from 128 to 256 improves retrieval recall by 2-4%, and going from 256 to 1024 adds another 3-5%. Beyond 2048, returns diminish. This is why large-batch contrastive training requires multi-GPU setups even for modest-sized encoder models.

The connection to metric learning is deep: InfoNCE with τ → 0 converges to triplet loss with the hardest negative. But unlike triplet loss (which only considers one negative per step), InfoNCE considers all negatives simultaneously, giving a much richer gradient. This is why InfoNCE replaced triplet loss as the standard for embedding training starting around 2019.

Hard negatives are the secret weapon. Random negatives are too easy — a document about cooking is obviously irrelevant to a query about programming. The model learns nothing from easy negatives. Hard negatives are documents that are superficially similar but actually irrelevant — like a Python snake article for a Python programming query. Mining hard negatives from BM25 top-100 (the "hard negative mining" trick) can boost retrieval quality by 5-15%.

CONCEPT: FAISS and Approximate Nearest Neighbors

With 50 billion document vectors (768 dimensions each), exact nearest neighbor search is impossible — it requires 50B dot products per query. Approximate Nearest Neighbor (ANN) algorithms trade a small accuracy loss for massive speedup. The three dominant approaches:

MethodHow it worksSpeedRecall@100
IVF (Inverted File)Cluster vectors into C centroids. At query time, probe only the P nearest clusters10-50ms95-99%
HNSW (Hierarchical NSW)Build a navigable small-world graph. Greedy search from top layer down1-10ms98-99.5%
PQ (Product Quantization)Compress 768-dim vectors to 32-64 bytes. Approximate distances in compressed space5-20ms90-97%

FAISS (Facebook AI Similarity Search) combines these: IVF for coarse partitioning, PQ for compression, and HNSW for the coarse quantizer. The compound index IVF4096,PQ64 means: 4096 clusters, each vector compressed to 64 bytes. This can search 1 billion vectors in under 5ms on a single machine.

CODE: Training a Bi-Encoder

python
import torch
import torch.nn.functional as F

class BiEncoder(torch.nn.Module):
    def __init__(self, encoder, dim=768):
        super().__init__()
        self.encoder = encoder       # e.g. BERT-base
        self.proj = torch.nn.Linear(dim, 128)  # compress to 128-d

    def encode(self, input_ids, attention_mask):
        out = self.encoder(input_ids, attention_mask)
        cls = out.last_hidden_state[:, 0]  # [CLS] token
        return F.normalize(self.proj(cls), dim=-1)

    def forward(self, q_ids, q_mask, d_ids, d_mask, tau=0.05):
        q_emb = self.encode(q_ids, q_mask)     # (B, 128)
        d_emb = self.encode(d_ids, d_mask)     # (B, 128)
        sim = q_emb @ d_emb.T / tau            # (B, B)
        labels = torch.arange(len(q_emb), device=sim.device)
        return F.cross_entropy(sim, labels)   # InfoNCE

DEBUG: When Dense Retrieval Fails

Symptom: Dense retrieval misses exact keyword matches that BM25 catches. User searches for error code "ERR_SSL_PROTOCOL_ERROR" and dense retrieval returns generic SSL articles, not the specific error page. Root cause: The embedding model was trained on natural language, not error codes. It maps all SSL-related content to a similar region of embedding space. Fix: Hybrid retrieval — take the union of BM25 top-100 (catches exact matches) and dense top-100 (catches semantic matches). This is why every production system uses both.

Symptom: ANN recall drops from 99% to 85% after adding 500M new vectors. Root cause: The IVF centroids were trained on the old distribution. New documents cluster differently. Fix: Retrain the IVF centroids periodically, or use HNSW which does not require retraining (but uses more memory).

Symptom: The bi-encoder retrieves the right documents for English queries but fails for multilingual queries. Root cause: The model was trained primarily on English (query, document) pairs. The embedding space has a large "English cluster" and small, poorly calibrated clusters for other languages. Fix: Fine-tune on multilingual training data using mBERT or XLM-R as the base encoder. Use cross-lingual distillation: train a multilingual student model to match the embeddings of a strong English teacher model, with parallel sentence pairs providing the alignment signal.

DESIGN: Hybrid Retrieval Fusion Strategies

When you combine BM25 and dense retrieval results, the fusion strategy matters. Three approaches, ranked by complexity:

StrategyHow it worksProsCons
Union + rerankTake union of top-K from each, rerank with cross-encoderSimple, reranker fixes errorsWastes reranker budget on duplicates
Reciprocal Rank Fusion (RRF)Score each doc by 1/(k+rank) summed across systemsScore-agnostic, no tuning neededIgnores magnitude of score differences
Learned fusion (CombMNZ)Normalize scores to [0,1], weighted sum with learned weightsOptimal given enough training dataRequires tuning, fragile to score drift

In practice, RRF (with k=60) is the default choice because it is robust, requires no calibration, and performs within 1-2% NDCG of learned fusion. The key insight: RRF does not care about the absolute scores from each system (BM25 scores might be 5-15, cosine similarities 0.3-0.9); it only cares about the ranking order. This makes it immune to score miscalibration.

DESIGN: Vector Index Architecture

For a 10-billion-vector index at 128 dimensions with PQ64 compression: each vector takes 64 bytes. Total: 640 GB. You shard across 16 machines (40 GB each). Each machine loads its shard into memory. A query hits all 16 in parallel; each returns its local top-100; the coordinator merges to global top-100 in ~1ms. Add 2x replication for fault tolerance and tail latency hedging. Total fleet: 32 machines. This is the scale architecture question interviewers love.

WORKED EXAMPLE: Embedding Arithmetic

Let us trace the full data flow from text to ANN result, with concrete numbers. This is what interviewers mean when they ask "walk me through dense retrieval end to end."

Input query: "best noise-canceling headphones" # 4 tokens (after tokenization: maybe 6 subwords) Tokenizer output: [101, 2190, 6994, 1011, 18542, 17854, 102] # BERT token IDs BERT output: (1, 7, 768) # 7 tokens x 768-d hidden states [CLS] pooling: (1, 768) # take first token Projection: (1, 128) # Linear(768, 128) reduces dimensionality L2 normalize: (1, 128) # ||q|| = 1, so dot product = cosine sim

Memory math for the index: 1 billion documents at 128 dimensions, FP32 = 1B × 128 × 4 bytes = 512 GB. With PQ64 compression: 1B × 64 bytes = 64 GB. That fits on a single machine with 128 GB RAM, or 2 shards with comfortable headroom.

Query latency math: Encode query with BERT-base (~5ms on GPU). HNSW search over 1B vectors (~3ms with ef_search=64). Total: ~8ms. Compare to BM25: ~12ms. Dense is actually faster per-query, but the index build is enormously more expensive (1B BERT forward passes vs. tokenization + posting list construction).

Key number to memorize: 1 billion 128-d FP32 vectors = 512 GB. With PQ64 = 64 GB. With scalar quantization (INT8) = 128 GB. These numbers come up in every system design interview about vector search.

DESIGN: Negative Mining Strategies

The quality of your bi-encoder hinges on the quality of your negatives during training. Here is the hierarchy, from weakest to strongest:

StrategySourceNDCG impactCompute cost
Random negativesRandom documents from corpusBaselineFree
In-batch negativesOther (query, doc) pairs in same batch+5-8%Free (already computed)
BM25 hard negativesTop-100 BM25 results that are not labeled relevant+8-15%One BM25 run per query
Model hard negatives (ANCE)Top-100 from the model's own ANN index, refreshed periodically+12-20%Rebuild ANN index every N steps
Cross-encoder distillationUse cross-encoder scores to mine hardest negatives+15-25%Cross-encoder inference on candidates

The ANCE (Approximate Nearest Neighbor Negative Contrastive Estimation) approach is particularly clever: periodically rebuild the ANN index using the current model, then use the model's own nearest neighbors as hard negatives. The model is forced to distinguish documents that it currently thinks are similar but are not relevant. This creates a curriculum where negatives get harder as the model improves.

FRONTIER: Matryoshka Embeddings & Multi-Vector (2024-2025)

Matryoshka Representation Learning (MRL) trains a single embedding model that produces useful embeddings at multiple dimensionalities — you can truncate a 768-d embedding to 128-d, 64-d, or even 32-d with graceful quality degradation. This lets you use 32-d for cheap first-pass retrieval and 768-d for reranking, from the same model. Nomic Embed and Jina Embeddings v3 (2024) both support MRL. Multi-vector approaches like ColBERTv2 store one vector per token, enabling richer similarity computation at moderate cost.

The MRL training trick is elegant: during training, apply the InfoNCE loss not just on the full 768-d embedding, but simultaneously on truncations at 512, 256, 128, 64, and 32 dimensions. Each truncation gets its own loss term. The model learns to pack the most important information into the first dimensions, with diminishing information in later dimensions — like a progressive JPEG for embeddings.

Embedding Space Explorer

Queries and documents are projected into 2D embedding space. Drag the query to see which documents are nearest. Toggle between cosine and dot-product similarity.

You're building a search system with 1 billion documents. You want sub-10ms retrieval with >95% recall. Which combination is most appropriate?

Chapter 3: Transformer Architectures for Search

Transformers revolutionized search not by replacing the pipeline, but by upgrading every stage of it. Query understanding, document encoding, retrieval, and reranking all benefit from self-attention. But the specific architecture you choose at each stage involves hard tradeoffs between accuracy, latency, and memory. This chapter maps those tradeoffs.

CONCEPT: BERT for Reranking

The simplest and most impactful application of transformers in search: take the top-100 results from BM25 or dense retrieval, and rerank them with a BERT cross-encoder. The cross-encoder sees the full (query, document) pair, so it understands context that bi-encoders miss.

Why is this so effective? Because attention is bidirectional. When BERT processes "[CLS] best noise canceling headphones [SEP] These premium over-ear headphones feature industry-leading ANC technology [SEP]", the word "headphones" in the query attends to "headphones" in the document, but also "noise canceling" attends to "ANC" — BERT learned that these are synonyms during pretraining on billions of tokens. A bi-encoder cannot do this because query and document are encoded independently.

The cost: BERT-base has 110M parameters. A single forward pass on a (query, 512-token document) pair takes ~10ms on a V100 GPU. Reranking 100 candidates: 1 second. Reranking 1000: 10 seconds. This is why cross-encoders are strictly a reranking stage, never a first-pass retrieval stage.

CONCEPT: ColBERT — Late Interaction

ColBERT (Contextualized Late Interaction over BERT) is a beautiful compromise between bi-encoders and cross-encoders. Instead of compressing each document to a single vector, ColBERT stores one vector per token. At query time, each query token computes its maximum similarity with any document token, and these maxima are summed:

score(q, d) = ∑i ∈ q maxj ∈ d Eq(i)T · Ed(j)

This is called MaxSim. It captures token-level interactions (like a cross-encoder) but keeps document encoding offline (like a bi-encoder). The tradeoff: you store N vectors per document instead of 1, so the index is N times larger. For a 200-token average document, that is 200x more storage. ColBERTv2 mitigates this with residual compression, reducing per-token storage from 128 bytes to ~2 bytes.

CODE: ColBERT MaxSim Implementation

python
import torch
import torch.nn.functional as F
from transformers import BertModel, BertTokenizer

class ColBERT(torch.nn.Module):
    def __init__(self, model_name="bert-base-uncased", dim=128):
        super().__init__()
        self.bert = BertModel.from_pretrained(model_name)
        self.proj = torch.nn.Linear(768, dim)  # compress per-token embs

    def encode(self, input_ids, attention_mask):
        """Returns per-token embeddings: (batch, seq_len, dim)"""
        out = self.bert(input_ids, attention_mask).last_hidden_state
        embs = self.proj(out)
        return F.normalize(embs, dim=-1) * attention_mask.unsqueeze(-1)

    def maxsim(self, q_embs, d_embs):
        """ColBERT's MaxSim: for each query token, find the max
        similarity with any document token, then sum.
        q_embs: (batch, q_len, dim)
        d_embs: (batch, d_len, dim)
        Returns: (batch,) scalar scores
        """
        # (batch, q_len, d_len) — similarity matrix per pair
        sim = torch.bmm(q_embs, d_embs.transpose(1, 2))
        # Max over document tokens for each query token
        max_sim = sim.max(dim=-1).values  # (batch, q_len)
        # Sum over query tokens
        return max_sim.sum(dim=-1)  # (batch,)

Storage math for ColBERT: A 200-token document at 128 dimensions = 200 × 128 × 4 bytes = 102,400 bytes per document. For 1 billion documents: 102 TB. This is 160x more than a single-vector bi-encoder (640 GB). ColBERTv2's residual compression reduces this to ~2 bytes per token-dimension, bringing the total to ~51 GB — comparable to a bi-encoder with PQ compression. This is why ColBERTv2, not original ColBERT, is used in production.

CONCEPT: Efficient Attention for Long Documents

Standard BERT has a 512-token limit. Web pages are often 5,000-50,000 tokens. You have three options:

StrategyHow it worksQualityCost
Passage splittingSplit document into 512-token passages, score each, take maxGoodMultiplies inference by N passages
Longformer/BigBirdSparse attention (local + global tokens), O(n) instead of O(n²)GoodCustom architecture, harder to fine-tune
Hierarchical encodingEncode passages with BERT, then aggregate with a second transformerBestTwo-stage inference

In practice, passage splitting with max-passage scoring dominates production systems because it is simple, parallelizable, and uses standard BERT checkpoints without modification.

CONCEPT: Query Expansion and Understanding

Before any retrieval happens, the query must be understood and potentially expanded. This is the most underappreciated stage of the pipeline:

Spell correction: Users misspell queries 10-15% of the time. "transphormer atention" should match documents about "transformer attention." Spell correction uses an edit-distance model trained on query logs (the most common correction for "transphormer" in your logs is "transformer").

Query expansion: Add semantically related terms. "heart attack" → "heart attack myocardial infarction cardiac arrest." This improves BM25 recall by matching documents that use medical terminology instead of colloquial terms. Modern query expansion uses a neural model (T5 or GPT-2) fine-tuned to generate relevant expansion terms.

Intent classification: Classify the query into intents: navigational ("facebook login"), informational ("how does BM25 work"), transactional ("buy sony headphones"). Each intent routes to a different ranking configuration. Navigational queries use URL-match features heavily; informational queries emphasize content quality; transactional queries boost product pages.

Query understanding is cheap and high-impact. Spell correction alone improves CTR by 2-5% for affected queries (10-15% of all traffic). Intent classification enables per-vertical ranking that improves NDCG by 3-8%. Total cost: a few small models (spell corrector, intent classifier, query expander) running in parallel at <5ms latency. This is the highest ROI investment in any search system.

CODE: Cross-Encoder Reranker

python
from transformers import AutoModelForSequenceClassification, AutoTokenizer

class CrossEncoderReranker:
    def __init__(self, model_name="cross-encoder/ms-marco-MiniLM-L-12-v2"):
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForSequenceClassification.from_pretrained(model_name)
        self.model.eval()

    def rerank(self, query: str, docs: list[str], top_k=10):
        pairs = [(query, doc) for doc in docs]
        inputs = self.tokenizer(
            pairs, padding=True, truncation=True,
            max_length=512, return_tensors="pt"
        )
        with torch.no_grad():
            scores = self.model(**inputs).logits.squeeze(-1)
        ranked = scores.argsort(descending=True)[:top_k]
        return [(docs[i], scores[i].item()) for i in ranked]

DEBUG: Cross-Encoder Pathologies

Symptom: Cross-encoder reranker assigns high scores to documents that are topically related but do not actually answer the query. User searches "when was Python 3.12 released" and the reranker favors a long Python history article over the short release announcement. Root cause: The model was fine-tuned on relevance labels that conflate "topically related" with "answers the query." Fix: Fine-tune with graded relevance labels (0=irrelevant, 1=related, 2=answers the query) and train the model to distinguish grades, not just relevant/irrelevant.

Symptom: Cross-encoder latency varies wildly (p50=30ms, p99=200ms) even with constant batch size. Root cause: Variable document lengths. A 50-token document and a 512-token document take very different amounts of time to process, and BERT's attention is O(n²). When a batch contains a mix of long and short documents, padding wastes compute, and the batch time is dominated by the longest document. Fix: Sort candidates by length and batch similar-length documents together (called "dynamic batching" or "bucketed batching"). This reduces padding waste by 30-50% and stabilizes latency.

Symptom: The cross-encoder gives suspiciously high scores to very short documents (1-2 sentences). Root cause: Short documents have less chance of containing irrelevant information, so the model learns a shortcut: short = relevant. This is a form of spurious correlation. Fix: Add document length as a feature during training (so the model can learn the true length-relevance relationship), and include negative examples that are short but irrelevant.

WORKED EXAMPLE: Passage Splitting for Long Documents

A 5,000-token web page cannot fit into a 512-token BERT model. Here is exactly how passage splitting works in production:

Document: 5,000 tokens total Passage size: 512 tokens Stride (overlap): 256 tokens ← 50% overlap prevents context loss at boundaries Passages generated: P1: tokens [0, 512) ← covers beginning P2: tokens [256, 768) ← overlaps P1 by 256 tokens P3: tokens [512, 1024) ... P18: tokens [4352, 4864) P19: tokens [4488, 5000) ← last passage may be shorter Total passages: 19 (approximately 2× the number of non-overlapping passages) Scoring options: MaxP: score = max(score(q, P_i) for all i) ← most common, works well FirstP: score = score(q, P1) ← cheapest, works for well-structured docs SumP: score = sum(score(q, P_i)) / n_passages ← for long documents where relevance is spread

MaxP is the standard because relevant information in a document is often concentrated in one section (the introduction, or a specific paragraph). A document about "attention mechanisms in transformers" might have 18 passages about other transformer components, but the one passage about attention will score very high — and MaxP captures it.

Passage splitting is also how RAG chunking works. When you build a RAG system, you split documents into passages (chunks) and index each passage separately. The query retrieves the most relevant passage, not the whole document. Chunk size (128-512 tokens), overlap (0-50%), and whether to include metadata (title, section headers) in each chunk are the key hyperparameters that most affect RAG quality.

DESIGN: Multi-Stage Ranking Architecture

The production architecture at scale uses 3-4 stages, each trading coverage for accuracy:

Stage 1: Retrieval (10ms)
BM25 + Dense ANN in parallel. ~1000 candidates from 50B documents. Cost per query: 2 index lookups.
Stage 2: Lightweight Scoring (5ms)
Feature-based LTR model (gradient-boosted trees). 1000 → 200 candidates. Cost: 1000 feature lookups + 1000 tree traversals.
Stage 3: Neural Reranking (40ms)
Cross-encoder or ColBERT. 200 → 20 candidates. Cost: 200 transformer forward passes (batched on GPU).
Stage 4: Blending & Personalization (5ms)
Combine neural scores with business rules, diversity, freshness, personalization. 20 → 10 results.

WORKED EXAMPLE: Cross-Encoder vs Bi-Encoder — Tracing the Attention

This example shows exactly why cross-encoders are more accurate and why that accuracy costs so much more:

BI-ENCODER: Query tokens: [CLS] best ANC headphones [SEP] Doc tokens: [CLS] premium noise cancel ##ling ear ##phones [SEP] Query encoding: self-attention among query tokens only → 768-d vector Doc encoding: self-attention among doc tokens only → 768-d vector Score: dot(q_vec, d_vec) = single number Problem: "ANC" never attends to "noise cancel" — they're in separate encoders CROSS-ENCODER: Joint tokens: [CLS] best ANC headphones [SEP] premium noise cancel ##ling ear ##phones [SEP] Encoding: FULL self-attention across ALL tokens Score: MLP([CLS] hidden state) = single number Advantage: "ANC" directly attends to "noise cancelling" — learns they're synonyms

The latency difference is stark. For 100 candidate documents:

Bi-encoder: 1 query encoding (5ms) + ANN lookup (3ms) = 8ms total Cross-encoder: 100 joint encodings × 10ms each = 1000ms total (125x slower!) With batching: 100 pairs batched on GPU = ~100ms (12x slower, acceptable for reranking)
The batch size is the key to making cross-encoders practical. Without batching, 100 candidates take 1 second. With batching on an A100, those same 100 candidates take 40-100ms. The batch size is limited by GPU memory: a BERT-base cross-encoder with 512-token inputs can batch ~64 pairs on a 16GB GPU. This is why the reranking stage has a hard cap on candidates.

FRONTIER: Generative Retrieval (2024-2025)

Differentiable Search Index (DSI) replaces the entire retrieve-then-rerank pipeline with a single transformer that directly generates document identifiers given a query. Instead of embedding documents into a vector space and doing ANN search, the model memorizes the entire corpus and "retrieves" by generating a structured document ID token-by-token. Google's GENRET and Meta's SEAL (2024) demonstrate this at moderate scale (millions of documents). The open question: can this scale to billions? Current evidence says not yet, but the trajectory is clear.

GritLM (Generative Representational Instruction-Tuned Language Model, 2024) unifies embedding and generation in a single model. It can produce high-quality embeddings (competitive with specialized bi-encoders) AND generate text (competitive with instruction-tuned LLMs) from the same weights. This hints at a future where the retrieval model and the generation model are the same model, eliminating the RAG pipeline entirely.

Multi-Stage Ranking Pipeline

Watch candidates get filtered through each ranking stage. Numbers show candidates remaining at each stage. Click stages to see scoring details.

ColBERT's MaxSim operation computes token-level interactions between query and document. Compared to a bi-encoder, what is ColBERT's main disadvantage?

Chapter 4: Recommendation Systems

Search requires a query. Recommendations do not. When you open YouTube and see a personalized feed, no explicit query was issued — the system inferred what you want from your history, demographics, and the behavior of similar users. Yet under the hood, recommendation and search share the same fundamental structure: score items against a context, return the top-k. The "context" is just different — an explicit query string vs. an implicit user profile.

CONCEPT: Collaborative Filtering from First Principles

The oldest and most intuitive recommendation idea: if users A and B both liked items 1, 2, and 3, and user A also liked item 4, then user B will probably like item 4 too. This is collaborative filtering (CF) — filtering items based on the collective behavior of users.

Mathematically, you have a user-item interaction matrix R (users × items). Most entries are missing (a user has interacted with a tiny fraction of items). CF fills in the missing entries. The most successful approach: matrix factorization. Decompose R ≈ U · VT, where U is (users × k) and V is (items × k). Each user and item gets a k-dimensional embedding. The predicted rating for user u on item i is:

ui = uuT · vi + bu + bi + μ

where bu and bi are user and item biases, and μ is the global mean. This is exactly what won the Netflix Prize in 2009, and the core idea (embed users and items into a shared space, predict via inner product) remains the foundation of every modern neural recommender.

The key to understanding collaborative filtering is the sparsity problem. A typical user-item matrix has 99.9% missing entries. A user on Netflix has rated maybe 200 of 10,000 movies. The matrix factorization must generalize from the 0.1% of observed entries to predict the 99.9% of unobserved entries. This works because the latent space is low-dimensional (k=50-200), acting as a strong regularizer — the model is forced to find simple patterns (like "action lover" or "indie film buff") rather than memorizing individual entries.

The training procedure: minimize the squared error between observed ratings and predictions, with L2 regularization on the embeddings to prevent overfitting:

L = ∑(u,i) ∈ observed (rui - uuT · vi - bu - bi - μ)² + λ(||uu||² + ||vi||²)

Optimization: SGD (iterate over observed entries, update user and item embeddings) or ALS (Alternating Least Squares — fix user embeddings and solve for items, then fix items and solve for users; this is parallelizable and popular in Spark MLlib).

CONCEPT: Content-Based Filtering

Collaborative filtering has a fatal flaw: the cold-start problem. A new item with zero interactions cannot be recommended (no column in the matrix). A new user with no history gets no personalization. Content-based filtering solves this by using item features (text, images, categories) instead of interaction signals. You build an item profile from its features and a user profile from the features of items they liked.

The modern implementation: use a pretrained encoder (BERT for text, CLIP for images) to embed item content. The user profile is the mean (or attention-weighted combination) of their liked items' embeddings. Recommendation is nearest-neighbor search in this embedding space — exactly the same infrastructure as dense retrieval.

Content-based filtering has another advantage: explainability. You can tell the user "We recommended this because it is similar to [item you liked]." With collaborative filtering, you can only say "Users similar to you liked this" — less satisfying and less actionable for the user.

The hybrid approach combines both: use collaborative filtering for users with rich history (many interactions), fall back to content-based for new users or new items. The transition is gradual — as a new user accumulates more interactions, the system progressively weights collaborative filtering more heavily. A common implementation: the ranking model has features from both CF (embedding similarity) and CB (content overlap), and the LTR model learns the optimal weighting automatically.

CONCEPT: Two-Tower Models

The two-tower model (also called dual-encoder) is the dominant architecture in industrial recommendation. One tower encodes the user (history, demographics, context), the other encodes the item (features, metadata). Both produce fixed-size vectors, and the score is their inner product:

score(u, i) = fθ(user_features)T · gφ(item_features)

This is architecturally identical to a bi-encoder for search. The user tower = query encoder. The item tower = document encoder. The training is the same (contrastive loss with in-batch negatives). The serving is the same (pre-compute item embeddings, ANN search at query time). This is why search and recommendations are converging.

The power of this architecture: at serving time, the item tower runs once offline (pre-compute all item embeddings), and the user tower runs once per request (embed the user's current context). Recommendation becomes an ANN lookup: find the K items whose embeddings are closest to the user embedding. This scales to billions of items because ANN search is sublinear.

Two-tower = bi-encoder for recommendations. If you understand one, you understand both. The only difference is what goes into each tower. Search: (query text, document text). Recommendations: (user history + context, item features). The math is identical, the infra is shared, and the training tricks (hard negatives, temperature scaling, in-batch negatives) transfer directly.

CODE: Two-Tower Recommender

python
import torch
import torch.nn as nn
import torch.nn.functional as F

class TwoTower(nn.Module):
    def __init__(self, n_users, n_items, n_feats, dim=64):
        super().__init__()
        # User tower: embed user ID + context features
        self.user_emb = nn.Embedding(n_users, dim)
        self.user_mlp = nn.Sequential(
            nn.Linear(dim + n_feats, 128), nn.ReLU(),
            nn.Linear(128, dim)
        )
        # Item tower: embed item ID + content features
        self.item_emb = nn.Embedding(n_items, dim)
        self.item_mlp = nn.Sequential(
            nn.Linear(dim + n_feats, 128), nn.ReLU(),
            nn.Linear(128, dim)
        )

    def forward(self, user_ids, user_feats, item_ids, item_feats, tau=0.1):
        u = self.user_mlp(torch.cat([self.user_emb(user_ids), user_feats], -1))
        v = self.item_mlp(torch.cat([self.item_emb(item_ids), item_feats], -1))
        u = F.normalize(u, dim=-1)
        v = F.normalize(v, dim=-1)
        sim = u @ v.T / tau             # (B, B)
        labels = torch.arange(len(u), device=sim.device)
        return F.cross_entropy(sim, labels)  # same InfoNCE as search!

WORKED EXAMPLE: Matrix Factorization by Hand

Let us trace one step of SGD for matrix factorization. This builds intuition for what the embedding vectors actually represent.

User-Item matrix (5=loved, 1=hated, 0=unknown): Item1 Item2 Item3 Item4 User1: [5, 3, 0, 1 ] User2: [4, 0, 0, 1 ] User3: [1, 1, 0, 5 ] Init k=2 embeddings (random): u₁ = [0.5, 0.3] v₁ = [0.4, 0.6] u₂ = [0.6, 0.2] v₂ = [0.3, 0.5] u₃ = [0.1, 0.7] v₃ = [0.5, 0.1] (unknown — will predict) v₄ = [0.2, 0.8] Predict r₁₃ (User1, Item3): r̂₁₃ = u₁ᵀ · v₃ = 0.5×0.5 + 0.3×0.1 = 0.28 ← predicted rating SGD update (assume true r₁₃=4, lr=0.01): error = 4 - 0.28 = 3.72 u₁ ← u₁ + lr × error × v₃ = [0.5 + 0.01×3.72×0.5, 0.3 + 0.01×3.72×0.1] = [0.519, 0.304] v₃ ← v₃ + lr × error × u₁ = [0.5 + 0.01×3.72×0.5, 0.1 + 0.01×3.72×0.3] = [0.519, 0.111]

After thousands of SGD steps, the user and item embeddings converge so that uiT · vj ≈ rij for all observed ratings. The magic: unobserved ratings (the zeros) also get predicted values, giving us recommendations.

What do the embedding dimensions mean? After training, each dimension captures a latent factor. In movie recommendations, dimension 1 might capture "action vs. romance" and dimension 2 might capture "mainstream vs. indie." You cannot choose what the dimensions mean — the model discovers factors that best explain the observed ratings.

DEBUG: Popularity Bias

Symptom: Your recommender keeps suggesting the same 100 popular items to everyone. Long-tail items never surface. Root cause: Popular items have orders of magnitude more positive interactions, so their embeddings are better trained and closer to most user embeddings. Fix: (1) Popularity-weighted sampling during training — downsample popular items, upsample rare ones. (2) Add an explicit diversity objective during serving (Maximal Marginal Relevance, MMR). (3) Calibration: adjust scores by item popularity to decouple quality from exposure.

Symptom: New users get terrible recommendations. Root cause: Cold start. The user tower has no history to work with. Fix: Fall back to content-based recommendations for the first N interactions. Use demographic features in the user tower. Show an explicit preference quiz ("pick 5 topics you like").

DESIGN: Recommendation Serving Architecture

At scale, recommendation serving mirrors search serving: pre-compute item embeddings, serve from ANN index, add a lightweight reranking stage for personalization and business rules. The difference: recommendation must handle real-time user signals (they just clicked on X, immediately update what we show). This requires a feature store with sub-millisecond reads and a near-real-time feature pipeline that updates user features within seconds of an interaction.

CONCEPT: Hybrid Recommenders — The Production Reality

No production recommender uses a single approach. They all use hybrid architectures that combine multiple signals:

Stage 1: Candidate Generation (10ms)
Multiple sources in parallel: (a) Two-tower ANN recall from user embedding, (b) Item-to-item ANN from recently interacted items, (c) Popularity-based candidates for diversity, (d) Content-based candidates for cold-start items. Union all sources → ~500 candidates.
Stage 2: Feature Enrichment (5ms)
For each candidate, fetch: user-item interaction history, item popularity metrics, freshness, category affinity scores, social signals (what similar users liked).
Stage 3: Ranking (20ms)
A deep ranking model (e.g., DCN-V2 or DLRM) scores each candidate using all features. This model predicts multiple objectives: P(click), P(like), P(purchase), P(share).
Stage 4: Re-ranking (5ms)
Apply business rules: diversity (no more than 3 from same category), freshness boost, ad insertion, regulatory compliance. Multi-objective optimization: balance engagement with long-term user satisfaction.

CODE: Deep Cross Network (DCN-V2) Ranking Model

python
import torch
import torch.nn as nn

class CrossLayer(nn.Module):
    """A single cross layer: captures feature interactions."""
    def __init__(self, dim):
        super().__init__()
        self.W = nn.Linear(dim, dim, bias=False)
        self.b = nn.Parameter(torch.zeros(dim))

    def forward(self, x0, x):
        # x0 is the original input, x is current layer output
        return x0 * self.W(x) + self.b + x  # element-wise cross

class DCNv2(nn.Module):
    """Deep & Cross Network V2 for ranking."""
    def __init__(self, input_dim, cross_layers=3, deep_dims=[256, 128]):
        super().__init__()
        # Cross network: explicit feature interactions
        self.crosses = nn.ModuleList([CrossLayer(input_dim) for _ in range(cross_layers)])
        # Deep network: implicit interactions via MLP
        layers = []
        prev = input_dim
        for d in deep_dims:
            layers += [nn.Linear(prev, d), nn.ReLU(), nn.BatchNorm1d(d)]
            prev = d
        self.deep = nn.Sequential(*layers)
        # Combine and predict
        self.out = nn.Linear(input_dim + deep_dims[-1], 1)

    def forward(self, x):
        # Cross tower
        x_cross = x
        for cross in self.crosses:
            x_cross = cross(x, x_cross)
        # Deep tower
        x_deep = self.deep(x)
        # Combine
        combined = torch.cat([x_cross, x_deep], dim=-1)
        return torch.sigmoid(self.out(combined))  # P(click)
Why DCN-V2 instead of a simple MLP? The cross layers explicitly model feature interactions (e.g., "user is in category X AND item is in category Y" → boost). An MLP can learn the same interactions implicitly, but needs exponentially more parameters. DCN-V2 is O(d) per cross layer vs. O(d²) for an equivalent MLP interaction. At feature dimensions of 500-1000 (common in production), this saves significant compute and improves convergence.

CONCEPT: Multi-Objective Ranking in Recommendations

Real-world recommendation systems optimize multiple objectives simultaneously. You do not just want clicks — you want clicks that lead to satisfaction, purchases, retention, and responsible content exposure:

ObjectiveWhy it mattersProxy metric
EngagementUsers interact with the contentP(click), P(like), P(comment)
SatisfactionUsers get value from what they consumeDwell time, session length, return rate
RevenuePlatform generates incomeP(purchase), ad click-through
ResponsibilityAvoid harmful, misleading, or low-quality contentP(report), P(hide), quality score
DiversityPrevent filter bubbles, surface new interestsTopic diversity, novelty score

The standard approach: train separate models for each objective, then combine with a weighted scalarization:

score(u, i) = w1 · P(click) + w2 · P(satisfy) + w3 · P(purchase) - w4 · P(harmful)

The weights are set by the product team (not ML) and encode business priorities. During elections, w4 (responsibility) might increase 5x. During holiday shopping season, w3 (revenue) gets a boost. These weights are the lever between ML and product strategy.

FRONTIER: Foundation Models for Recommendations (2024-2025)

LLM-based recommendations are an active frontier. Approaches like P5 and InstructRec frame recommendation as text generation: "Based on the user's history [item1, item2, item3], recommend the next item." These can handle cold start (they understand item descriptions) and transfer across domains. UniRec (2024) unifies retrieval, ranking, and generation in a single model. The tradeoff: LLM inference is 100-1000x slower than a two-tower dot product, making them impractical for first-stage retrieval but promising for reranking or explanation generation.

Sequence-based recommenders (SASRec, BERT4Rec) model the user's interaction history as a sequence and predict the next item using transformer self-attention. The user's last 50 interactions become a "sentence" and the next item is the "next word." This captures temporal dynamics (what the user is interested in RIGHT NOW, not just historically) and achieves 5-15% improvement over static two-tower models. The cost: sequence encoding at query time (cannot be pre-computed), adding ~10ms per user. Production systems use distilled 2-4 layer transformers for this.

Collaborative Filtering Visualizer

A user-item matrix where colors indicate ratings. Watch matrix factorization decompose it into user and item embeddings. Missing entries (gray) get predicted values.

A two-tower recommendation model and a bi-encoder search model both use InfoNCE loss with in-batch negatives. What is the fundamental architectural similarity?

Chapter 5: Large-Scale Training

Training a BERT-base model on a single GPU takes about 4 days on Wikipedia alone. Training a search ranking model on 10 billion query-document pairs? Months on a single GPU. Web-scale AI demands distributed training — spreading the work across hundreds or thousands of GPUs. But distributing training is not just "run the same code on more machines." It introduces synchronization bottlenecks, communication overhead, memory constraints, and subtle numerical issues that can silently degrade your model.

CONCEPT: Why Web-Scale Training is Different

Training a search model is fundamentally different from training a language model. Language models train on text-only data with a simple next-token prediction objective. Search models train on (query, document, label) triples with contrastive objectives, where the quality of negatives matters as much as the quality of positives, and where training data has complex biases (position bias, selection bias, popularity bias) that must be corrected.

The key constraint at web scale: your training data is so large that you cannot afford to iterate over it many times. A 10B-pair training set at 100ms/batch takes 7 days for one epoch on 128 GPUs. Training for 10 epochs would take 70 days — unacceptable for a team shipping monthly model updates. This means every training decision must be right the first time: learning rate schedule, negative mining strategy, data sampling, and early stopping criteria.

CONCEPT: Data Parallelism

The simplest form of distributed training: every GPU has a full copy of the model. Each GPU processes a different mini-batch. After the forward-backward pass, all GPUs average their gradients (the all-reduce operation) and apply the same update. This is equivalent to training with a larger batch size.

gglobal = (1/N) ∑i=1N gi     (average gradients from N GPUs)

The bottleneck: all-reduce communication. For a 110M-parameter model (BERT-base), each gradient is 440 MB (110M params × 4 bytes). With 8 GPUs, you must send 440 MB across the network after every step. The ring all-reduce algorithm distributes this: each GPU sends and receives (2 × 440 / 8) = 110 MB per step, overlapped with computation. With a fast interconnect (NVLink: 600 GB/s), the overhead is ~1ms per step. With Ethernet (25 Gbps): ~140ms. This is why GPU clusters use NVLink intra-node and InfiniBand inter-node.

CONCEPT: Model Parallelism

What if the model does not fit on a single GPU? A 7B-parameter model in FP16 takes 14 GB — fits on an A100 (80 GB). A 70B model takes 140 GB — does not fit. Model parallelism splits the model across GPUs:

TypeHow it splitsWhen to use
Tensor parallelism (TP)Split weight matrices within a layer across GPUs. E.g., a 4096×4096 matrix split across 4 GPUs: each holds 4096×1024Within a node (requires fast interconnect: all-to-all every layer)
Pipeline parallelism (PP)Assign different layers to different GPUs. GPU 0 runs layers 1-6, GPU 1 runs layers 7-12, etc.Across nodes (only sends activations between stages, less comm)
Sequence parallelism (SP)Split the sequence dimension. Each GPU processes a chunk of the sequenceFor very long sequences (needed with 100K+ context)

CONCEPT: Mixed Precision Training

Mixed precision uses FP16 for forward/backward passes and FP32 for weight updates. This halves memory usage and doubles throughput on GPUs with tensor cores. The key trick: keep a master copy of weights in FP32. After computing FP16 gradients, cast them to FP32, update the FP32 master weights, then cast back to FP16 for the next forward pass. This prevents the accumulation of small gradient updates that FP16 precision would round to zero. Loss scaling multiplies the loss by a large constant (e.g., 1024) before the backward pass to prevent FP16 gradient underflow, then divides after.

WORKED EXAMPLE: Memory Budget for Training

Let us compute the exact GPU memory required to train a BERT-base model. This calculation comes up in system design interviews when they ask "how many GPUs do you need?"

Model: BERT-base, 110M parameters FP32 Training (standard): Model weights: 110M × 4 bytes = 440 MB Gradients: 110M × 4 bytes = 440 MB Adam optimizer: 110M × 8 bytes = 880 MB (two momentum states per param) Activations: ~1-4 GB (depends on batch size × seq_len) Total: ~2.8-5.8 GB ← fits on one 8GB GPU at small batch FP16 Mixed Precision: FP16 weights: 110M × 2 bytes = 220 MB FP16 gradients: 110M × 2 bytes = 220 MB FP32 master copy: 110M × 4 bytes = 440 MB FP32 Adam states: 110M × 8 bytes = 880 MB FP16 activations: ~0.5-2 GB Total: ~2.3-3.8 GB ← 35% less memory, can increase batch size Scaling to 7B parameters (e.g., Llama-7B): FP16 Mixed Precision: FP16 weights: 7B × 2 bytes = 14 GB FP32 master: 7B × 4 bytes = 28 GB FP32 Adam: 7B × 8 bytes = 56 GB Activations: ~20-40 GB Total: ~118-138 GB ← needs 2× A100 80GB with model parallelism
The optimizer dominates memory. AdamW stores two extra FP32 values per parameter (first and second moment estimates). For a 7B model, that is 56 GB just for the optimizer. This is why DeepSpeed ZeRO-3 shards optimizer states across GPUs — it is the biggest memory win.

CODE: Distributed Training Setup

python
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.cuda.amp import autocast, GradScaler

def train(rank, world_size):
    dist.init_process_group("nccl", rank=rank, world_size=world_size)
    torch.cuda.set_device(rank)

    model = BiEncoder(encoder=BertModel.from_pretrained("bert-base"))
    model = model.cuda(rank)
    model = DDP(model, device_ids=[rank])  # wrap for data parallelism

    optimizer = torch.optim.AdamW(model.parameters(), lr=2e-5)
    scaler = GradScaler()  # for mixed precision

    for batch in dataloader:
        optimizer.zero_grad()
        with autocast(dtype=torch.float16):  # FP16 forward
            loss = model(**batch)
        scaler.scale(loss).backward()     # FP16 backward + loss scaling
        scaler.step(optimizer)            # FP32 update
        scaler.update()

DEBUG: Training Instabilities at Scale

Symptom: Loss spikes or diverges when scaling from 8 to 64 GPUs. Root cause: The effective batch size jumped 8x (from 256 to 2048). Large batches require a learning rate warmup — without it, the initial gradients are too noisy and the model overshoots. Fix: Apply the linear scaling rule: when you multiply batch size by K, multiply learning rate by K. But only after a warmup period (typically 1-10% of total steps) where the LR linearly increases from 0 to the target.

Symptom: Training loss decreases but validation metrics plateau. Root cause: With very large batch sizes (>8192), the model converges to sharper minima that generalize worse. Fix: Use LAMB (Layerwise Adaptive Moments for Batch training) optimizer, which adapts the learning rate per layer. Or: add gradient noise. Or: reduce effective batch size with gradient accumulation instead of pure data parallelism.

Symptom: FP16 training suddenly produces NaN losses after 10K steps (was fine before). Root cause: Gradient explosion in FP16 — a gradient value exceeded FP16 max (65504) and became Inf, which propagated through the network as NaN. This often happens when the learning rate is too high or when the loss landscape has sharp cliffs (common in contrastive learning). Fix: (1) Reduce learning rate. (2) Enable gradient clipping (max_norm=1.0). (3) Use BF16 instead of FP16 — BF16 has the same exponent range as FP32 (max value 3.4e38 vs FP16's 65504), preventing overflow at the cost of slightly lower precision.

Symptom: Model performance differs between single-GPU and multi-GPU training. Root cause: Batch normalization statistics computed per-GPU diverge from global statistics. With 32 samples per GPU, the batch mean/variance is noisy. Fix: Use SyncBatchNorm which computes statistics across all GPUs. Or replace BatchNorm with LayerNorm (which is batch-size independent) — this is why modern transformers use LayerNorm exclusively.

DESIGN: Training Infrastructure for Web Scale

Training a search model on 10B pairs with a batch size of 4096, using 128 A100 GPUs: ~50,000 steps, ~6 hours wall-clock time. The cluster needs: (1) a fast distributed filesystem for training data (HDFS or cloud object storage with aggressive prefetching), (2) NVLink within nodes + InfiniBand between nodes, (3) checkpointing every 1000 steps to distributed storage, (4) automatic failure recovery (when a GPU dies mid-training, restart from the last checkpoint on a spare node).

WORKED EXAMPLE: Training Throughput Calculation

Model: BERT-base bi-encoder (110M params) Dataset: 10 billion (query, document) pairs Batch size: 4096 (across all GPUs) GPUs: 128 × A100 80GB (16 nodes × 8 GPUs) Batch per GPU: 4096 / 128 = 32 Forward + backward per step: Compute per GPU: ~80ms (32 pairs × BERT forward + backward) All-reduce: ~5ms (110M × 4 bytes = 440MB, ring all-reduce over NVLink+IB) Total per step: ~85ms Steps to see all data once: 10B / 4096 = 2,441,406 steps per epoch Training 3 epochs: 7,324,218 steps × 85ms/step = 622,558 seconds ≈ 7.2 days Reality check: DPR (Facebook) reported 40 hours on 8 GPUs for 21M pairs. Scaling to 10B pairs on 128 GPUs: 10B/21M × 40h / (128/8) = 1,190 hours / 16 = 74 hours ≈ 3 days. Our estimate of 7 days is conservative (includes overhead).
GPU hours matter for cost. 128 A100s × 7 days = 21,504 GPU-hours. At $2/GPU-hour (cloud spot pricing), that is $43,000 per training run. At 10 experiments to find the best hyperparameters: $430,000. This is why training efficiency research (GaLore, FSDP, mixed precision) directly translates to dollars saved.

CONCEPT: Gradient Accumulation — When You Cannot Use More GPUs

Sometimes you need a large effective batch size but cannot afford more GPUs. Gradient accumulation simulates a larger batch by accumulating gradients over multiple micro-batches before taking an optimizer step:

python
# Simulate batch_size=2048 on a single GPU with batch_size=32
accumulation_steps = 2048 // 32  # = 64 micro-batches
optimizer.zero_grad()

for i, batch in enumerate(dataloader):
    with autocast(dtype=torch.float16):
        loss = model(**batch) / accumulation_steps  # normalize loss
    scaler.scale(loss).backward()       # accumulate gradients

    if (i + 1) % accumulation_steps == 0:
        scaler.step(optimizer)          # update weights
        scaler.update()
        optimizer.zero_grad()           # reset for next accumulation

The trade-off: gradient accumulation gives you the same gradients as a large batch, but N times slower (you process N micro-batches sequentially instead of in parallel). It also does not help with in-batch negatives for contrastive learning — each micro-batch only has 32 negatives, not 2048. For contrastive training, you need actual data parallelism.

FRONTIER: Training Efficiency (2024-2025)

DeepSpeed ZeRO-3 shards optimizer states, gradients, AND parameters across GPUs, enabling training of models that are 8x larger than GPU memory allows. FSDP (Fully Sharded Data Parallel) in PyTorch provides similar functionality natively. Ring Attention (2024) enables training on sequences of arbitrary length by distributing the sequence across GPUs in a ring topology. GaLore (2024) reduces optimizer memory by projecting gradients into a low-rank subspace, enabling full-parameter fine-tuning of 7B models on a single 24GB GPU.

Cross-batch negatives (2024) solve the in-batch negative limitation: store embeddings from previous micro-batches in a queue and use them as negatives. This gives you 10,000+ negatives even with a micro-batch of 32, eliminating the need for large actual batch sizes in contrastive training. The memory overhead is trivial (a queue of embeddings) and the quality improvement is substantial (5-10% recall gain over standard in-batch negatives).

Distributed Training Simulator

Adjust the number of GPUs and batch size to see how training throughput scales. Watch for the communication bottleneck as GPU count increases.

GPUs 8
Batch/GPU 32
You scale training from 8 GPUs to 64 GPUs (8x) and the loss diverges immediately. What is the most likely cause and fix?

Chapter 6: Serving a Web Index

Training the best model in the world is worthless if you cannot serve it within a 200ms latency budget at 50,000 queries per second. Serving is where theory meets physics — memory bandwidth, network latency, cache hit rates, and tail latency dominate your engineering decisions. This chapter is about making web-scale AI actually work in production.

CONCEPT: Latency Budgets

A search result page must load in under 200ms for the user to perceive it as "instant." Here is how that budget is typically allocated:

StageBudgetWhat happens
Network roundtrip20-50msUser → edge → datacenter → edge → user
Query parsing + expansion5msSpell correct, synonym expand, intent classify
BM25 retrieval10-15msInverted index lookup across shards
ANN retrieval10-15msDense embedding search across shards
Feature enrichment5msFetch precomputed features from feature store
Neural reranking30-50msCross-encoder on top-200 candidates (GPU batch)
Result assembly5-10msBlend, diversify, personalize, serialize
Total85-150msLeaves 50-115ms margin for p99 variance

The critical insight: retrieval and reranking run in series, but BM25 and ANN retrieval run in parallel. If either takes too long, the whole pipeline misses the budget. This is why p99 latency (not p50) is the metric that matters — one slow shard delays everything.

CONCEPT: Quantization for Serving

Vector quantization compresses embedding vectors to reduce memory and speed up distance computation. Product Quantization (PQ) divides a 128-d vector into 16 sub-vectors of 8 dimensions each, then quantizes each sub-vector to its nearest centroid from a codebook of 256 entries. The compressed representation: 16 bytes (one byte per sub-vector). This is a 32x compression from the original 512 bytes (128 × 4 bytes for FP32).

Model quantization compresses the neural reranker. INT8 quantization reduces model size by 4x and doubles throughput on GPUs with INT8 tensor cores (A100, H100). The accuracy loss is typically <0.5% NDCG. For even more aggressive compression, GPTQ and AWQ (2024) achieve 4-bit quantization with <1% quality loss.

WORKED EXAMPLE: Quantization Math

Let us trace Product Quantization step by step. This is a favorite interview question: "explain PQ to me like I'm implementing it."

Original vector: 128 dimensions, FP32 → 512 bytes per vector Step 1: Split into sub-vectors 128-d vector → 16 sub-vectors of 8 dimensions each Step 2: Train codebooks (offline, on a sample of vectors) For each sub-vector position, run k-means with K=256 centroids Each centroid is an 8-d vector Codebook size: 16 positions × 256 centroids × 8 dims × 4 bytes = 131 KB (tiny!) Step 3: Compress each vector For each sub-vector, find its nearest centroid → store the centroid ID (1 byte, since K=256) Compressed vector: 16 bytes (one byte per sub-vector position) Compression ratio: 512 / 16 = 32× ← this is why PQ enables billion-scale search Step 4: Approximate distance computation Query sub-vector to centroid distance → precompute 16 × 256 = 4096 distances Sum 16 centroid distances per candidate → 16 additions per candidate For 1M candidates: 16M additions ≈ 0.5ms on CPU (vs 128M multiplications for exact)

The quality loss from PQ depends on K and the number of sub-vectors (m). Typical configurations:

ConfigBytes/vectorCompressionRecall@100 loss
PQ16 (m=16, K=256)1632×3-8%
PQ32 (m=32, K=256)3216×1-4%
PQ64 (m=64, K=256)640.5-2%
OPQ64 (optimized rotation + PQ64)640.3-1%
OPQ is the production choice. Optimized Product Quantization applies a learned rotation matrix to the vectors before splitting into sub-vectors. This rotation decorrelates the dimensions, making each sub-vector more independently quantizable. The 0.3-1% recall loss is nearly invisible in practice.

CONCEPT: Caching Strategies

Search queries follow a power law: the top 1% of queries account for 30-40% of traffic. Caching these "head queries" eliminates repeated computation. Multi-level caching:

Result cache: Cache the final rendered result page. Hit rate: 15-25% for web search, 40-60% for e-commerce (fewer unique queries). TTL: 5-60 minutes depending on freshness requirements.

Embedding cache: Cache query embeddings. An embedding takes ~5ms to compute; a cache hit takes ~0.1ms. Hit rate: 30-50% (many queries are near-duplicates that map to similar embeddings).

Feature cache: Cache document features. A feature store lookup takes 1-2ms; a cache hit takes ~0.05ms. Hit rate: 80-95% (popular documents appear in many results).

CONCEPT: Index Sharding Strategies

How you partition the index across machines affects both throughput and tail latency:

Document-based sharding: Hash each document ID to a shard. Every query hits ALL shards (scatter-gather). Pros: even load distribution, simple to scale horizontally. Cons: every query has fan-out = shard count, amplifying tail latency.

Term-based sharding: Assign posting lists by term. A query for "transformer attention" hits only the shards containing those terms (2-3 shards, not 64). Pros: lower fan-out, less tail latency. Cons: hot terms ("the", "is") create load imbalance; skewed traffic patterns.

Tiered sharding: The production approach at Google/Bing. Divide the index into tiers by document quality (PageRank). Tier 0 (top 1% of documents by quality) fits on fewer machines and is always searched. Tier 1 (next 10%) is searched only if Tier 0 does not return enough high-quality results. Tier 2 (everything else) is only searched for unusual queries. This reduces average fan-out by 3-5× while maintaining coverage.

Tiered sharding is the secret to Google's speed. When you search for "weather" or "facebook," the answer comes from Tier 0 (a tiny, heavily cached shard). The full 50B-document index is only searched for rare or complex queries. This is why common queries feel instantaneous (sub-50ms) while rare queries take longer (100-200ms).

DESIGN: Tail Latency Mitigation

At 50K QPS with 64-shard scatter-gather, tail latency dominates the user experience. Here is the complete toolkit for taming it:

TechniqueHow it worksLatency reductionCost
Hedged requestsSend to 2 replicas, take fastestp99 → ~p502× read traffic
Request deadlinesIf a shard does not respond in 30ms, skip it (return partial results)Hard cap on tailOccasional incomplete results
Canary queriesBefore routing traffic, probe each shard's health with a synthetic queryAvoids known-slow shardsSmall overhead per shard
Load-aware routingRoute to the replica with the shortest queue, not round-robin30-50% p99 reductionRequires load reporting
Over-provisioningKeep 20% spare capacity for handling load spikesPrevents saturation-induced latency20% more machines

In practice, all five techniques are used simultaneously. The most impactful single technique is hedged requests, which converts tail latency from a function of the slowest shard to a function of the second-slowest shard — a dramatic improvement when slowness is caused by transient issues (GC pauses, page cache misses, network congestion).

CODE: ANN Index with Quantization

python
import faiss
import numpy as np

def build_index(embeddings: np.ndarray, nlist=4096, m=64, nbits=8):
    """Build an IVF-PQ index for billion-scale search.
    embeddings: (N, dim) float32 array
    nlist: number of IVF clusters
    m: number of PQ sub-quantizers (compressed size = m bytes)
    nbits: bits per sub-quantizer (256 centroids = 8 bits)
    """
    dim = embeddings.shape[1]

    # Define the compound index: IVF for partitioning + PQ for compression
    quantizer = faiss.IndexFlatL2(dim)        # coarse quantizer
    index = faiss.IndexIVFPQ(quantizer, dim, nlist, m, nbits)

    index.train(embeddings[:min(1_000_000, len(embeddings))])  # train on sample
    index.add(embeddings)                     # add all vectors
    index.nprobe = 64                         # search 64 of 4096 clusters

    return index  # ~m bytes per vector (vs 4*dim for full float32)

def search(index, query_emb: np.ndarray, top_k=100):
    distances, indices = index.search(query_emb, top_k)
    return indices[0], distances[0]  # top-k doc IDs and distances

DEBUG: Latency Spikes

Symptom: p99 latency spikes from 150ms to 800ms every hour for 5 minutes. Root cause: Garbage collection pauses on JVM-based services (Elasticsearch, Lucene). Fix: Tune GC settings (use G1GC with max pause target of 50ms), or migrate hot paths to non-GC languages (C++/Rust for the ANN index, which is what FAISS does).

Symptom: ANN latency doubles after an index rebuild. Root cause: The new index is loaded into cold memory; the OS page cache has not warmed up yet. Fix: Pre-warm the index by issuing synthetic queries before routing live traffic to the new shard. Or use a stale-while-reindex pattern: keep serving from the old index while the new one warms up.

Symptom: p50 latency is 80ms (good) but p99 is 450ms (terrible). The gap is 5.6x. Root cause: Fan-out amplification. With 64 BM25 shards, the p99 of the request is approximately the p99 of the slowest shard. If each shard has p99 = 30ms, the probability that ALL 64 shards respond within 30ms is (0.99)^64 = 0.52. So 48% of requests wait for a slow shard. Fix: Hedged requests. Send each shard request to 2 replicas; take whichever responds first. This converts each shard's distribution from p99 to approximately p50 at the cost of 2x the read traffic. Google's "The Tail at Scale" paper (Dean & Barroso, 2013) is the canonical reference.

CODE: Hedged Request Implementation

python
import asyncio
from typing import Any

async def hedged_request(
    replicas: list[str],   # list of replica endpoints
    query: dict,
    timeout_ms: float = 50,  # hedge after this timeout
) -> Any:
    """Send to primary replica. If no response in timeout_ms,
    send to secondary. Return whichever responds first."""

    async def fetch(endpoint):
        return await http_client.post(endpoint, json=query)

    # Start primary request
    primary = asyncio.create_task(fetch(replicas[0]))

    # Wait up to timeout_ms
    done, pending = await asyncio.wait(
        [primary], timeout=timeout_ms / 1000
    )

    if done:
        return primary.result()  # primary was fast enough

    # Primary is slow — hedge with secondary
    secondary = asyncio.create_task(fetch(replicas[1]))
    done, pending = await asyncio.wait(
        [primary, secondary], return_when=asyncio.FIRST_COMPLETED
    )

    result = done.pop().result()
    for p in pending:
        p.cancel()  # cancel the slower one
    return result

DESIGN: Serving Architecture at Scale

50K QPS across a 10-billion-document index. Each query hits: (1) BM25 scatter-gather across 64 shards, (2) ANN scatter-gather across 16 shards, (3) feature store batch read for 200 candidates, (4) GPU reranking on a pool of 8 A100s. Total machines: ~200 (with replication). Total cost: ~$500K/month cloud. The biggest cost driver is GPU reranking; the biggest reliability risk is tail latency on the scatter-gather fan-out.

WORKED EXAMPLE: Cost Estimation

An interviewer asks: "Estimate the infrastructure cost for a 10B-document search system at 50K QPS." Walk through it methodically:

BM25 Serving (CPU-based): Index size: 10B docs × ~500 bytes/doc avg posting = 5 TB total Sharded across 64 machines (80 GB/shard, fits in RAM) 2× replication = 128 machines c5.4xlarge (16 vCPU, 32GB RAM): $0.68/hr × 128 = $87/hr Monthly: $87 × 730 = $63,500 ANN Serving (CPU or GPU): Index: 10B × 64 bytes (PQ64) = 640 GB 16 shards × 2 replicas = 32 machines (40 GB/shard) r5.4xlarge (16 vCPU, 128GB RAM): $1.01/hr × 32 = $32/hr Monthly: $32 × 730 = $23,600 GPU Reranking: 50K QPS, 100 candidates/query, batch 64, ~2ms/batch on A100 QPS per GPU: 64 / 0.002 = 32K candidates/sec → 320 queries/sec GPUs needed: 50K / 320 = 156 → use 8 machines with 20 GPUs p4d.24xlarge (8× A100): $32.77/hr × 8 = $262/hr Monthly: $262 × 730 = $191,300 Feature Store (Redis cluster): ~200 candidates × 50K QPS = 10M feature reads/sec r6g.4xlarge (16 vCPU, 128GB): can do ~500K reads/sec Need ~20 nodes: $1.01/hr × 20 = $20/hr Monthly: $20 × 730 = $14,600 Total monthly: ~$293,000 GPU reranking is 65% of total cost — this is the lever to optimize.
The cost optimization playbook: (1) Reduce reranking candidates (100→50 saves 50% GPU cost, loses ~0.5% NDCG). (2) Distill the cross-encoder into a cheaper model (MiniLM reduces 12 layers to 6, ~2× faster at ~1% quality loss). (3) Use result caching for head queries (20% cache hit rate = 20% less compute). (4) Use spot instances for the ANN/BM25 fleet (3× cheaper, acceptable for stateless serving with fast failover).

CONCEPT: Index Update Strategies

A web index must be continuously updated as new pages are crawled and old pages change. But rebuilding the entire index (50B documents) takes hours. Three strategies for keeping the index fresh:

Batch rebuild: Rebuild the full index daily. Simple, but 24-hour staleness. Good enough for most web search (news vertical is the exception).

Incremental update: Add new documents to the existing index without full rebuild. For inverted indexes, append new postings. For HNSW vector indexes, insert new nodes (HNSW supports incremental insertion). For IVF-PQ, assign new vectors to the nearest existing centroid. Quality degrades slowly as the centroid distribution drifts; retrain centroids weekly.

Lambda architecture: Maintain two indexes: a large "batch" index (rebuilt daily) and a small "real-time" index (updated within minutes). At query time, search both and merge results. This gives sub-hour freshness with the quality of a full rebuild. The real-time index is small (<1% of documents) and fits on a single machine.

FRONTIER: GPU-Native Indexes (2024-2025)

cuVS (NVIDIA's CUDA Vector Search, 2024) runs IVF-PQ and CAGRA (a GPU-native graph index) entirely on GPU, achieving 10-100x speedup over CPU FAISS for billion-scale indexes. Milvus 2.4 and Weaviate integrate GPU-accelerated ANN. The emerging pattern: keep the ANN index on GPU alongside the reranking model, eliminating CPU-GPU data transfer. This collapses the retrieval + reranking latency from 60ms to 15ms.

Scalar quantization (SQ8/SQ4) is emerging as a simpler alternative to Product Quantization. Instead of the complex PQ codebook, just round each FP32 dimension to INT8 (4x compression) or INT4 (8x compression). Quality loss is comparable to PQ at similar compression ratios, but SQ is dramatically simpler to implement and faster to encode. Combined with GPU-native indexes, SQ4 + CAGRA can search 1 billion vectors in under 2ms.

Latency Budget Allocator

Adjust the latency budget for each pipeline stage. The bar chart shows cumulative latency vs. the 200ms deadline. Red = over budget.

Rerank candidates 100
ANN nprobe 64
Your search system processes 50K QPS and uses scatter-gather across 64 BM25 shards. What determines the tail latency of each query?

Chapter 7: Learning to Rank

So far we have scoring functions (BM25, bi-encoder similarity, cross-encoder relevance) that each produce a number for a (query, document) pair. But search results are a list, not independent scores. A result at position 1 matters far more than a result at position 10. Learning to Rank (LTR) is the discipline of directly optimizing the ordering of results, not just individual relevance scores.

CONCEPT: Three Paradigms

Pointwise: Treat ranking as regression. Predict the relevance score for each (query, document) pair independently, then sort by predicted score. Simple, but ignores the fact that what matters is the relative ordering, not the absolute score. A model that predicts relevance 0.8 for all documents produces a random ranking — useless even though each prediction is "close."

Pairwise: For each pair of documents (di, dj) under the same query, predict which one is more relevant. The loss penalizes incorrect orderings. The most famous pairwise method is RankNet, which uses a cross-entropy loss on the probability that di should rank above dj:

L = -∑(i,j) yij log(σ(si - sj)) + (1 - yij) log(1 - σ(si - sj))

where σ is the sigmoid function and yij = 1 if di should rank above dj.

Listwise: Directly optimize a list-level metric like NDCG. The challenge: NDCG involves sorting, which is not differentiable. LambdaMART solves this elegantly — it computes the gradient of the pairwise loss weighted by the NDCG gain from swapping each pair. Documents whose swap would most improve NDCG get the strongest gradient signal. LambdaMART (implemented in XGBoost/LightGBM) is the most widely deployed LTR algorithm in production search.

LambdaMART's genius: making NDCG differentiable. It does not optimize NDCG directly (impossible — sorting is not differentiable). Instead, for every pair of documents, it asks: "if I swapped these two, how much would NDCG change?" This ΔNDCG becomes the weight on the pairwise gradient. The result: a gradient-boosted tree that effectively optimizes NDCG without ever computing its gradient.

CONCEPT: NDCG

Normalized Discounted Cumulative Gain (NDCG) is the standard metric for ranking quality. It measures how well the ranking puts the most relevant documents at the top:

DCG@k = ∑i=1k (2reli - 1) / log2(i + 1)
NDCG@k = DCG@k / IDCG@k

where IDCG is the DCG of the ideal ranking (perfectly sorted by relevance). The log denominator discounts positions: position 1 gets full credit, position 10 gets ~30% credit. This captures user behavior — people look at the first few results carefully and barely glance at results on page 2.

CODE: LambdaMART with LightGBM

python
import lightgbm as lgb
import numpy as np

# Features: BM25, bi-encoder sim, doc freshness, click rate, etc.
# Labels: relevance grades (0=irrelevant, 1=fair, 2=good, 3=perfect)
# Groups: number of docs per query (for listwise optimization)

train_data = lgb.Dataset(
    X_train, label=y_train, group=group_train,
    feature_name=["bm25", "dense_sim", "freshness",
                  "click_rate", "doc_length", "title_match"]
)

params = {
    "objective": "lambdarank",      # listwise LTR
    "metric": "ndcg",
    "ndcg_eval_at": [1, 5, 10],
    "num_leaves": 255,
    "learning_rate": 0.05,
    "min_data_in_leaf": 50,
    "lambdarank_truncation_level": 20,  # only optimize top-20
}

model = lgb.train(params, train_data, num_boost_round=1000,
                  valid_sets=[val_data], callbacks=[lgb.early_stopping(50)])

WORKED EXAMPLE: NDCG by Hand

Let us compute NDCG@5 for a specific ranking. This is a whiteboard favorite.

Ranking: [Doc A, Doc B, Doc C, Doc D, Doc E] Relevance: [3, 0, 2, 1, 3 ] (0=irrelevant, 3=perfect) DCG@5 = (2^3-1)/log₂(2) + (2^0-1)/log₂(3) + (2^2-1)/log₂(4) + (2^1-1)/log₂(5) + (2^3-1)/log₂(6) = 7/1 + 0/1.58 + 3/2 + 1/2.32 + 7/2.58 = 7.0 + 0 + 1.5 + 0.43 + 2.71 = 11.64 Ideal ranking: [3, 3, 2, 1, 0] (sorted by relevance) IDCG@5 = 7/1 + 7/1.58 + 3/2 + 1/2.32 + 0/2.58 = 7.0 + 4.43 + 1.5 + 0.43 + 0 = 13.36 NDCG@5 = 11.64 / 13.36 = 0.871

Notice: the ranking puts a relevant doc (rel=3) at position 5, wasting it. If we swapped positions 2 and 5, NDCG would jump because position 2 has a higher discount factor. This is exactly the insight that LambdaMART exploits — it computes the NDCG delta for every possible swap and focuses the gradient on swaps that matter most.

DEBUG: Feature Leakage in LTR

Symptom: LTR model achieves NDCG@10 of 0.95 on test set but only 0.72 in production. Root cause: Feature leakage — the training data includes click-through rate, which is a lagging indicator of relevance. A document shown at position 1 gets high CTR regardless of relevance (position bias). The model learns to rank already-high-position documents higher, creating a feedback loop. Fix: Use position-debiased CTR (divide by position-specific baseline CTR), or train on randomized traffic where positions are shuffled.

DESIGN: Feature Engineering for LTR

A production LTR model typically uses 100-500 features. They fall into categories:

CategoryExample featuresRelative importance
Query-Doc MatchBM25, dense similarity, title match, URL match, exact phrase matchHighest (40-50% of model gain)
Document QualityPageRank, spam score, freshness (days since update), content length, readability scoreHigh (20-30%)
User ContextSearch history, geographic location, device type, time of day, languageMedium (10-15%)
Interaction SignalsHistorical CTR, average dwell time, long-click rate, reformulation rateMedium (10-15%)
Cross FeaturesQuery length × doc length, query category × doc category, user expertise × doc difficultyLow but valuable (5-10%)

The most important features are almost always BM25 and dense similarity — the rest provide 5-10% incremental improvement. But that incremental improvement matters enormously at web scale: 5% NDCG improvement translates to millions more satisfied users per day.

CODE: Position-Debiased Click Model

python
import numpy as np
from collections import defaultdict

class PositionBasedModel:
    """Estimate position-debiased relevance from click logs."""
    def __init__(self, max_pos=10):
        self.exam_prob = np.ones(max_pos)    # P(examine | position)
        self.rel_counts = defaultdict(lambda: [0, 0])  # [clicks, impressions]

    def fit(self, click_log, iterations=20):
        """EM algorithm: alternate between estimating examination
        and relevance probabilities."""
        for _ in range(iterations):
            # E-step: estimate P(relevant | click, pos)
            pos_clicks = np.zeros(len(self.exam_prob))
            pos_imps = np.zeros(len(self.exam_prob))
            for query, doc, pos, clicked in click_log:
                if pos >= len(self.exam_prob): continue
                pos_imps[pos] += 1
                if clicked:
                    pos_clicks[pos] += 1
                    self.rel_counts[(query, doc)][0] += 1
                self.rel_counts[(query, doc)][1] += 1
            # M-step: update examination probabilities
            self.exam_prob = np.clip(pos_clicks / (pos_imps + 1e-8), 0.01, 1.0)

    def debiased_relevance(self, query, doc, pos):
        """P(relevant | doc) = P(click) / P(examine | pos)"""
        clicks, imps = self.rel_counts.get((query, doc), [0, 1])
        raw_ctr = clicks / max(1, imps)
        return raw_ctr / max(0.01, self.exam_prob[min(pos, len(self.exam_prob)-1)])

CONCEPT: Feature Interaction in LTR

One of the most powerful aspects of tree-based LTR is automatic feature interaction discovery. A decision tree naturally captures interactions like "if BM25 > 5 AND freshness < 7 days, then boost score." These interactions are hard to specify manually but critical for ranking quality.

Here are the top feature interactions typically learned by LambdaMART, in order of importance:

InteractionWhat it capturesExample
BM25 × title_matchExact query-title alignment signals navigational intent"wikipedia" + title contains "Wikipedia" → rank #1
freshness × query_typeNews queries need recent docs; evergreen queries do not"election results 2024" needs docs from today
dense_sim × doc_qualitySemantically similar but low-quality docs should be demotedSpam farm copies high-quality content
CTR × position_seenHigh CTR at low positions is a stronger signal than high CTR at position 1A doc clicked 30% of times shown at position 8 is excellent
doc_length × query_lengthLong queries (specific) match better with long docs (detailed)"how to fix React useState infinite loop" → StackOverflow answer

FRONTIER: Neural LTR (2024-2025)

Traditional LTR uses gradient-boosted trees (LambdaMART). The frontier: transformer-based LTR that takes the full list of candidates as input and produces a permutation. Models like SetRank and PiRank use self-attention across candidates to capture inter-document dependencies (e.g., two documents from the same source should not both appear at top). ListT5 (2024) uses a seq2seq model to directly generate the ranked list as a sequence of document IDs, optimized with REINFORCE.

The most practical frontier advancement is distillation from cross-encoders into LambdaMART. You run a cross-encoder on your training queries to generate soft relevance labels (much richer than binary click/no-click), then train LambdaMART on these soft labels. This gives you cross-encoder quality at tree-inference speed — the best of both worlds. Google's 2024 paper on "teacher-student ranking" showed this approach closes 60-70% of the gap between LambdaMART and cross-encoder reranking while adding zero serving latency.

NDCG Calculator

Drag documents to reorder them. Relevance grades are shown (3=perfect, 0=irrelevant). NDCG updates in real-time as you reorder. Try to maximize NDCG@5.

LambdaMART weights its pairwise gradients by the NDCG change from swapping each pair. Why is this weighting crucial?

Chapter 8: Relevance & Evaluation

You deployed a new ranking model. NDCG@10 improved 3% on your offline test set. But is it actually better for users? Offline metrics can lie — they are computed on a static dataset that may not represent real user behavior. Online evaluation (A/B testing) is the ground truth, but it is expensive and slow. This chapter covers the full evaluation stack: offline metrics, online experiments, and the dangerous gap between them.

CONCEPT: Offline Metrics

MRR (Mean Reciprocal Rank): For queries with a single correct answer (navigational queries), MRR measures 1/rank of the first correct result, averaged across queries. If the correct result is at position 3, the reciprocal rank is 1/3. MRR is the right metric when users want ONE thing.

MRR = (1/|Q|) ∑q ∈ Q 1/rankq

MAP (Mean Average Precision): For queries with multiple correct results, MAP averages the precision at each relevant position. If relevant documents are at positions 1, 3, and 7, the average precision is (1/1 + 2/3 + 3/7) / 3 = 0.72. MAP penalizes relevant results that appear after irrelevant ones.

NDCG@k: As covered in Chapter 7, NDCG handles graded relevance (not just binary relevant/irrelevant). It is the standard metric for web search because relevance is naturally graded — a perfect answer is better than a good answer, which is better than a tangentially related document.

CONCEPT: Online Evaluation — A/B Testing

Offline metrics tell you how the model performs on historical data. Online A/B testing tells you how it performs on real users. You split live traffic 50/50 between the old model (control) and the new model (treatment). After 1-2 weeks (depending on traffic volume), you compare key metrics:

MetricWhat it capturesTypical improvement bar
CTRUsers click something — basic engagement+0.5%
Dwell timeUsers spend time on clicked results — result quality+1%
Reformulation rateUsers rephrase their query — indicates dissatisfaction-0.5%
Session success rateUsers find what they need (no further searches)+0.3%
Long-click rateUsers click and stay >30 seconds — strongest quality signal+0.2%
Offline gains do not guarantee online gains. A model with +5% NDCG offline might show 0% or even negative online improvement. Common reasons: (1) the offline test set is not representative, (2) the model is better at easy queries but worse at hard queries (which matter more to users), (3) the model reduces diversity, leading to user fatigue. Always A/B test before declaring victory.

CONCEPT: Interleaving

Interleaving is a more sensitive online evaluation method than A/B testing. Instead of showing each user results from only one model, you merge results from both models into a single list (using balanced interleaving or team-draft interleaving). You then measure which model's results get more clicks. Interleaving detects preferences with 10-100x less traffic than A/B testing because every user is exposed to both models.

WORKED EXAMPLE: A/B Test Power Analysis

Before launching an A/B test, you must compute the required sample size. If you run the test too short, you will not detect real improvements (Type II error). If you declare a winner too early, you might be fooled by noise (Type I error).

Goal: detect a 1% absolute improvement in CTR (from 35% to 36%) Significance level (α): 0.05 (5% chance of false positive) Power (1-β): 0.80 (80% chance of detecting a real 1% difference) n = (Z_{α/2} + Z_β)² × (p₁(1-p₁) + p₂(1-p₂)) / (p₁ - p₂)² n = (1.96 + 0.84)² × (0.35×0.65 + 0.36×0.64) / (0.01)² n = 7.84 × (0.2275 + 0.2304) / 0.0001 n = 7.84 × 0.4579 / 0.0001 n ≈ 35,900 per group At 50K queries per day, 50/50 split: Per group: 25K/day Required duration: 35,900 / 25,000 ≈ 1.5 days In practice: run for at least 7 days to capture day-of-week effects.
The most common A/B testing mistake: "We saw a 2% improvement after 24 hours, let's ship it." Without sufficient sample size, a 2% observed difference could easily be noise. Always compute the required sample size BEFORE starting the test, and do not peek at results until you reach it (or use sequential testing methods like always-valid p-values).

CONCEPT: Click Models

Clicks are the cheapest signal of relevance, but they are biased. Users click position 1 far more than position 5, regardless of relevance (position bias). Users click attractive snippets even for irrelevant documents (attraction bias). Click models decompose observed clicks into examination probability (depends on position) and relevance probability (depends on document quality):

P(click | position, doc) = P(examine | position) · P(relevant | doc)

The Position-Based Model (PBM) assumes examination depends only on position. The Cascade Model assumes users scan from top to bottom and stop after the first relevant click. The Dynamic Bayesian Network (DBN) is the most realistic: it models the user as scanning top-to-bottom, clicking if they examine a result AND find it relevant, and potentially continuing past clicks if the clicked result was not fully satisfying. These models let you extract unbiased relevance labels from biased click data — essential for training LTR models.

In practice, the choice of click model matters less than having one at all. A naive approach (raw CTR as relevance) introduces severe position bias. Any click model (even the simple PBM) removes 70-80% of this bias. The more sophisticated models (DBN) remove 90%+ but are harder to fit and require more data.

WORKED EXAMPLE: Position Bias in Practice

Let us see how position bias distorts raw click data and why debiasing is critical:

Position 1: 10,000 impressions, 4,500 clicks → raw CTR = 45% Position 5: 10,000 impressions, 800 clicks → raw CTR = 8% Position 10: 10,000 impressions, 200 clicks → raw CTR = 2% But how much of this is the document's quality vs. the position? Examination probability (from PBM on randomized data): P(examine | pos=1) = 0.95 ← users almost always look at pos 1 P(examine | pos=5) = 0.35 ← only 35% of users scroll this far P(examine | pos=10) = 0.10 ← 10% reach page bottom Debiased relevance = raw_CTR / P(examine): Position 1: 0.45 / 0.95 = 0.47 ← truly relevant Position 5: 0.08 / 0.35 = 0.23 ← much more relevant than raw CTR suggested! Position 10: 0.02 / 0.10 = 0.20 ← also more relevant than raw 2% suggested

Without debiasing, documents that happen to be shown at position 5 get a CTR of 8% and are deemed low-quality. But after debiasing, their true relevance is 23% — they are actually decent results that suffer from low exposure. Training an LTR model on raw clicks creates a feedback loop: documents at low positions get low labels → model ranks them lower → they get even less exposure → their CTR drops further. Debiasing breaks this loop.

CODE: Computing NDCG

python
import numpy as np

def dcg(relevances, k):
    relevances = np.array(relevances)[:k]
    gains = 2 ** relevances - 1
    discounts = np.log2(np.arange(2, len(relevances) + 2))
    return np.sum(gains / discounts)

def ndcg(relevances, k=10):
    actual = dcg(relevances, k)
    ideal = dcg(sorted(relevances, reverse=True), k)
    return actual / ideal if ideal > 0 else 0.0

# Example: document relevance grades in the order they appear
ranking = [3, 0, 2, 1, 0, 3, 0, 0, 1, 2]
print(f"NDCG@5: {ndcg(ranking, 5):.3f}")   # ~0.82
print(f"NDCG@10: {ndcg(ranking, 10):.3f}") # ~0.79

DEBUG: Metric Disagreements

Symptom: NDCG@10 improved 3% offline, but online A/B test shows -0.5% session success rate. Root cause: The new model is better at ranking navigational queries (where there is one right answer) but worse at informational queries (where users want diverse perspectives). NDCG does not capture diversity — it only measures relevance. Fix: Add α-NDCG (diversity-aware NDCG) to offline metrics. Segment online results by query type to identify where the regression happens.

CODE: Interleaving Implementation

python
import random

def team_draft_interleave(ranking_a: list, ranking_b: list, k=10) -> tuple:
    """Interleave two rankings using Team Draft method.
    Returns (interleaved_list, team_assignments).
    team_assignments[i] = 'A' or 'B' indicating which model contributed doc at position i.
    """
    result, teams = [], []
    seen = set()
    ptr_a, ptr_b = 0, 0

    while len(result) < k:
        # Randomly choose which team picks next
        team_a_turn = random.random() < 0.5

        if team_a_turn:
            while ptr_a < len(ranking_a) and ranking_a[ptr_a] in seen:
                ptr_a += 1
            if ptr_a < len(ranking_a):
                result.append(ranking_a[ptr_a])
                teams.append('A')
                seen.add(ranking_a[ptr_a])
                ptr_a += 1
        else:
            while ptr_b < len(ranking_b) and ranking_b[ptr_b] in seen:
                ptr_b += 1
            if ptr_b < len(ranking_b):
                result.append(ranking_b[ptr_b])
                teams.append('B')
                seen.add(ranking_b[ptr_b])
                ptr_b += 1

    return result, teams

# After collecting clicks, count wins for each team:
# If user clicked positions where team_assignments='A', team A wins.
# Aggregate across thousands of queries → statistical test.

DESIGN: Evaluation Infrastructure

A production evaluation system needs: (1) an offline eval pipeline that runs every model checkpoint against 5-10 eval sets (each targeting different query types), (2) a canary deployment system that routes 1% of traffic to the new model and monitors key metrics for anomalies, (3) a full A/B testing framework with proper statistical power analysis (minimum detectable effect, required sample size), (4) interleaving for rapid comparison of close models.

DESIGN: Eval Set Construction

The quality of your offline evaluation depends entirely on the quality of your eval sets. A good eval framework maintains 5-10 eval sets, each capturing a different failure mode:

Eval SetSizeWhat it testsHow labels are collected
Head queries5K queriesQuality on the most common queriesHuman judgments (3 raters per pair)
Tail queries10K queriesLong-tail query coverageClick-based labels (debiased)
Freshness queries1K queriesCurrent events, new contentHuman judgments (weekly refresh)
Adversarial queries500 queriesAmbiguous, tricky, or adversarial inputsExpert annotation
Vertical queries2K/verticalQuality per vertical (images, video, news)Vertical-specific human judgments
International queries5K/languageNon-English query qualityNative-speaker human judgments

CODE: Maximal Marginal Relevance (MMR)

MMR balances relevance with diversity by penalizing candidates that are too similar to already-selected results:

python
import numpy as np

def mmr_rerank(query_emb, doc_embs, doc_scores, k=10, lam=0.7):
    """Maximal Marginal Relevance: balance relevance vs diversity.
    query_emb: (dim,) query embedding
    doc_embs:  (N, dim) document embeddings
    doc_scores: (N,) relevance scores from reranker
    lam: 1.0 = pure relevance, 0.0 = pure diversity
    Returns: indices of selected documents
    """
    selected = []
    remaining = list(range(len(doc_scores)))

    for _ in range(k):
        best_idx, best_score = -1, float('-inf')
        for idx in remaining:
            # Relevance term
            relevance = doc_scores[idx]
            # Diversity term: max similarity to already selected
            if selected:
                sims = [np.dot(doc_embs[idx], doc_embs[s])
                        for s in selected]
                max_sim = max(sims)
            else:
                max_sim = 0
            # MMR score
            score = lam * relevance - (1 - lam) * max_sim
            if score > best_score:
                best_score, best_idx = score, idx
        selected.append(best_idx)
        remaining.remove(best_idx)

    return selected
λ = 0.7 is the production sweet spot. At λ = 1.0, you get pure relevance ranking (no diversity). At λ = 0.5, you aggressively diversify and may push relevant results down. Empirically, λ = 0.7 gives the best user satisfaction: enough diversity to avoid "all the same" complaints, enough relevance to keep the best results on top. This is one of those parameters that every search company has independently converged on.

WORKED EXAMPLE: Online vs. Offline Metric Disagreement

This scenario comes up in every search interview and catches unprepared candidates:

Offline eval: Model A: NDCG@10 = 0.72 (baseline) Model B: NDCG@10 = 0.75 (+4.2% improvement) Online A/B test (2 weeks, 500K queries per arm): Model A: CTR = 38.2%, Reformulation rate = 12.1% Model B: CTR = 38.0%, Reformulation rate = 12.3% Result: B is WORSE online despite being BETTER offline. What happened?

Diagnosis steps:

1. Segment by query type. Model B improved NDCG on informational queries (Wikipedia-style) by +8% but degraded transactional queries (buy/download) by -3%. Informational queries are more common in the test set, inflating offline metrics. But transactional queries drive CTR (users click to buy), so the online metric drops.

2. Check the eval set representativeness. The offline eval set was built from human judgments collected 6 months ago. The query distribution has shifted (more product searches due to holiday season). The eval set no longer represents current traffic.

3. Check for diversity regression. Model B's higher NDCG comes from putting the single best result higher (NDCG rewards this). But it also reduces result diversity, showing 3 results from the same domain. Users scan once, see no variety, and reformulate. Lower diversity → higher reformulation rate → worse user experience.

4. Resolution: Add transactional query eval set. Add diversity metrics (α-NDCG) to offline eval. Add domain diversity constraint to result assembly. Re-run A/B test.

FRONTIER: LLM-as-Judge for Search (2024-2025)

Human relevance judgments are expensive ($0.50-2.00 per judgment). LLM-as-Judge uses GPT-4 or Claude to assess relevance at 1000x the speed and 100x lower cost. TREC 2024 showed that GPT-4 judgments correlate r=0.85 with human judgments for graded relevance. The catch: LLMs have their own biases (verbosity bias, anchoring to position 1). Calibration with a small human-annotated set is essential.

The production setup: use LLM-as-Judge for rapid offline eval (run on every model checkpoint, ~$50 per eval), but always validate with human judgments before launching (reserve $5-10K budget for human eval on promising models). This gives you fast iteration cycles with a human-quality safety net.

Evaluation Metrics Dashboard

Compare two ranking models across multiple metrics. The radar chart shows relative strengths. Hover over metrics to see details.

Interleaving detects ranking preferences with 10-100x less traffic than standard A/B testing. Why?

Chapter 9: Data Pipelines at Scale

Models are only as good as the data they train on. At web scale, the data challenges are staggering: billions of web pages that change daily, click logs generating terabytes per hour, feature stores that must serve sub-millisecond lookups, and data quality issues that silently degrade model performance. This chapter covers the infrastructure that turns raw web data into training signals.

CONCEPT: Web Crawling

A web crawler discovers and downloads web pages. At scale, this is a distributed system problem: how to crawl billions of URLs efficiently, politely (respecting robots.txt and rate limits), and freshly (re-crawling changed pages). The URL frontier is a priority queue of URLs to crawl, ordered by (1) importance (PageRank), (2) change frequency, and (3) time since last crawl.

Key metrics: coverage (what fraction of the web you have indexed), freshness (how recently each page was crawled), and politeness (requests per second per domain — typically 1 req/sec per domain to avoid overwhelming servers).

CODE: URL Frontier with Priority Scheduling

python
import heapq
import time
from dataclasses import dataclass, field

@dataclass(order=True)
class CrawlTask:
    priority: float          # lower = more urgent
    url: str = field(compare=False)
    domain: str = field(compare=False)
    last_crawled: float = field(compare=False, default=0)

class URLFrontier:
    def __init__(self, politeness_delay=1.0):
        self.heap = []                     # priority queue
        self.domain_last_access = {}       # domain → timestamp
        self.politeness = politeness_delay  # seconds between same-domain requests

    def add(self, url, domain, pagerank=0.0, change_freq=1.0):
        # Priority: lower pagerank (more important pages first)
        # + staleness bonus (older crawls get priority)
        priority = -pagerank + 1.0 / max(0.01, change_freq)
        heapq.heappush(self.heap, CrawlTask(priority, url, domain))

    def next_url(self):
        """Return next URL respecting politeness constraints."""
        now = time.time()
        skipped = []
        while self.heap:
            task = heapq.heappop(self.heap)
            last = self.domain_last_access.get(task.domain, 0)
            if now - last >= self.politeness:
                self.domain_last_access[task.domain] = now
                for s in skipped:  # re-add skipped tasks
                    heapq.heappush(self.heap, s)
                return task.url
            skipped.append(task)  # domain too recent, skip
        # All domains on cooldown
        for s in skipped:
            heapq.heappush(self.heap, s)
        return None

CONCEPT: Deduplication

The web is full of duplicates. Exact copies (same URL, different mirrors), near-duplicates (same article, different ads), and boilerplate pages (product listings that differ only in one field). Training on duplicates wastes compute and biases the model toward duplicated content.

Exact deduplication: Hash each document (SHA-256). Same hash = exact duplicate. O(1) per document with a hash table. Simple, fast, catches only exact copies.

Near-duplicate detection: MinHash + LSH (Locality-Sensitive Hashing). Represent each document as a set of shingles (n-gram character sequences). Compute k min-hash signatures. Documents with high Jaccard similarity (many matching min-hashes) are near-duplicates. Can process billions of documents with a single pass over the data.

WORKED EXAMPLE: MinHash Similarity Estimation

Let us trace MinHash dedup on two documents to see why it works.

Doc A: "the quick brown fox jumps" Doc B: "the quick brown fox leaps over" 5-char shingles: A: {"the q", "he qu", "e qui", "quic", "uick ", "ick b", "ck br", "k bro", "brow", "rown", "own f", "wn fo", "n fox", " fox ", "fox j", "ox ju", "x jum", " jump", "jumps"} B: {"the q", "he qu", "e qui", "quic", "uick ", "ick b", "ck br", "k bro", "brow", "rown", "own f", "wn fo", "n fox", " fox ", "fox l", "ox le", "x lea", " leap", "leaps", "eaps ", "aps o", "ps ov", "s ove", " over"} Jaccard similarity: |A ∩ B| / |A ∪ B| = 14 / 29 ≈ 0.48 This means ~48% of their shingles overlap. Threshold for near-duplicate: typically 0.8+. These docs are similar but not near-duplicates (different verbs).

The beauty of MinHash: the probability that two documents' MinHash values agree equals their Jaccard similarity. With 128 hash functions, the estimated Jaccard is accurate to ±0.09 with 95% confidence. LSH (Locality-Sensitive Hashing) then buckets documents by bands of hash values, so you only compare documents that share at least one band — reducing the all-pairs comparison from O(n²) to O(n).

CONCEPT: Feature Stores

A feature store bridges the gap between batch feature computation (offline, slow, cheap) and online feature serving (real-time, fast, expensive). Features like "CTR for this document on similar queries" are computed daily in a batch job (Spark/Flink), written to a feature store (Redis, DynamoDB, Feast), and read at sub-millisecond latency during serving.

The key abstraction: features are keyed by (entity_id, feature_name, timestamp). The feature store handles: (1) point-in-time correctness (for training, only use features available at the time of the interaction — no future leakage), (2) online/offline consistency (the same feature definition is used for training and serving), (3) versioning (rollback to previous feature values if a computation is buggy).

Point-in-time correctness is the most common source of training/serving skew. If your training pipeline computes "average CTR over all time" but your serving pipeline computes "CTR over last 7 days," the feature values will differ between training and serving. The model learned to interpret the training distribution, but receives different values at serving time. This causes silent quality degradation. The fix: define every feature once (as code), compute it identically for both training and serving, and use timestamp-aware joins during training data construction.

Production feature stores (Feast, Tecton, Hopsworks) provide exactly this abstraction. The online store (Redis, DynamoDB) serves features at p99 < 5ms. The offline store (BigQuery, Hive, Delta Lake) serves historical features for training. A materialization job keeps them in sync. The cost: non-trivial infrastructure complexity, but the alternative (ad-hoc feature computation that inevitably diverges between training and serving) is worse.

CODE: MinHash Deduplication

python
import hashlib
from collections import defaultdict

def shingles(text, k=5):
    """Generate character k-shingles from text."""
    return {text[i:i+k] for i in range(len(text) - k + 1)}

def minhash_signature(shingle_set, num_hashes=128):
    """Compute MinHash signature: num_hashes minimum hash values."""
    sig = []
    for i in range(num_hashes):
        min_h = float('inf')
        for s in shingle_set:
            h = int(hashlib.md5(f"{i}_{s}".encode()).hexdigest(), 16)
            min_h = min(min_h, h)
        sig.append(min_h)
    return tuple(sig)

def lsh_buckets(signature, bands=16):
    """Split signature into bands for LSH bucketing."""
    rows = len(signature) // bands
    buckets = []
    for b in range(bands):
        band = signature[b*rows : (b+1)*rows]
        bucket = hashlib.md5(str(band).encode()).hexdigest()
        buckets.append((b, bucket))
    return buckets  # docs sharing a bucket in any band are candidates

DEBUG: Data Quality Issues

Symptom: Model quality degrades over 3 months despite no model changes. Root cause: A data pipeline change silently dropped non-English documents from the training set. The model gradually became English-centric as older training data aged out. Fix: Monitor data distribution metrics (language distribution, document length distribution, source diversity) and alert on significant shifts. Add data validation checks (Great Expectations, Deequ) to the pipeline.

Symptom: Click-through rate features in the feature store are stale by 24 hours. Root cause: The batch job that computes CTR runs once daily at 3 AM. New content published during the day has no CTR feature and falls back to a default value, getting systematically under-ranked. Fix: Add a real-time click counter (using Kafka + Flink) that updates CTR estimates within minutes. Use the batch-computed CTR as a prior and the real-time counter as a Bayesian update.

Symptom: Your embedding model suddenly produces embeddings that cluster into two distinct regions instead of a smooth distribution. Root cause: A new batch of training data was accidentally duplicated 5x, creating a strong bias toward those documents' topics. The model overfit to the duplicated topics during the last training run. Fix: (1) Add deduplication to the training data pipeline (not just the document pipeline). (2) Add a training data validation check that alerts if any (query, doc) pair appears more than N times. (3) Monitor embedding space statistics (average cosine similarity, cluster count via k-means) after each training run.

CONCEPT: Online Feature Engineering with Streaming

The highest-impact features for ranking are real-time signals that change within minutes or hours. Building a streaming feature pipeline requires different engineering from batch pipelines:

FeatureUpdate latencyInfrastructure
Trending score5 minutesFlink/Spark Streaming counting queries per topic over sliding windows
Real-time CTR1 minuteKafka consumer updating Redis counters with Bayesian smoothing
User last-clickedSecondsDirect write to user session store on click event
Breaking news flagMinutesAnomaly detector on query volume spikes per topic
Document freshnessHoursCrawler metadata propagated through the index update pipeline

The architecture: Kafka as the event bus, Flink for stream processing, Redis for real-time feature serving, with a reconciliation job that periodically validates streaming features against batch-computed ground truth. The key risk: streaming features can be noisy (a click spike could be a bot attack, not genuine interest). Always smooth with batch baselines.

DESIGN: End-to-End Data Architecture

The data flows: (1) Crawler writes raw HTML to object storage (S3/GCS). (2) A parsing job extracts text, metadata, links. Deduplicates using MinHash. (3) An embedding job runs documents through the bi-encoder, writes vectors to the ANN index. (4) A feature computation job computes document quality signals, writes to the feature store. (5) Click logs stream through Kafka into a data lake (Delta Lake/Iceberg). (6) A training data pipeline joins click logs with document features to produce (query, document, label) training triples. (7) Model training reads from the data lake. The entire pipeline runs daily with ~6-hour end-to-end latency.

CODE: Data Quality Monitoring

python
import numpy as np
from dataclasses import dataclass

@dataclass
class DataQualityReport:
    total_docs: int
    null_rate: dict       # feature_name → fraction of nulls
    distribution_drift: dict  # feature_name → KL divergence from baseline
    duplicate_rate: float
    alerts: list

def check_training_data(current_batch, baseline_stats):
    """Run data quality checks before training.
    Catches silent pipeline bugs that degrade model quality."""
    alerts = []
    report = DataQualityReport(
        total_docs=len(current_batch),
        null_rate={}, distribution_drift={},
        duplicate_rate=0.0, alerts=[]
    )

    # Check 1: Volume anomaly (±30% from expected)
    expected = baseline_stats["expected_volume"]
    if abs(len(current_batch) - expected) / expected > 0.3:
        alerts.append(f"Volume anomaly: {len(current_batch)} vs expected {expected}")

    # Check 2: Null rate per feature
    for feature in baseline_stats["features"]:
        nulls = sum(1 for row in current_batch if row.get(feature) is None)
        null_rate = nulls / max(1, len(current_batch))
        report.null_rate[feature] = null_rate
        if null_rate > baseline_stats["max_null_rate"].get(feature, 0.01):
            alerts.append(f"High null rate for {feature}: {null_rate:.2%}")

    # Check 3: Distribution drift (simplified KL divergence)
    # In production, use Great Expectations or custom Flink monitoring
    report.alerts = alerts
    return report

DESIGN: Training Data Pipeline Architecture

Building (query, document, relevance_label) training triples at scale is its own engineering challenge. Here is the production pipeline:

1. Click Log Collection
Kafka ingests click events: (query, doc_id, position, clicked, dwell_time, timestamp, user_id). Volume: ~1B events/day for a large search engine.
2. Session Extraction
Group clicks by (user, session). A session ends after 30 minutes of inactivity. Extract: queries issued, documents clicked, reformulations, final click (likely the answer).
3. Position Debiasing
Apply the Position-Based Model (PBM) to estimate P(relevant | doc) from P(click | position, doc). This corrects for the fact that position 1 gets 5× more clicks than position 5 regardless of quality.
4. Label Aggregation
Aggregate across sessions: a (query, doc) pair seen 1000 times with 400 debiased clicks gets relevance label 0.4. Bin into grades: [0-0.1]=irrelevant, [0.1-0.3]=fair, [0.3-0.6]=good, [0.6-1.0]=excellent.
5. Feature Joining
For each (query, doc) pair, join with document features (freshness, PageRank, length) and query features (popularity, category). Time-travel join: only use features available at the time of the click (no future leakage).

CONCEPT: ETL for ML — Training Data as a Product

The most underappreciated part of web-scale search is treating your training data as a product with its own SLOs:

SLOTargetHow to measure
FreshnessTraining data <24 hours staleMax timestamp in training set vs. current time
Coverage>95% of active documents in training setFraction of served documents that appear in training data
Label quality<5% noisy labelsAgreement rate between click-based labels and human judgments
Feature completeness<1% null rate for critical featuresAutomated null-rate monitoring per feature
Distribution stability<0.05 KL divergence from baselineDaily distribution checks per feature
When the data pipeline breaks silently, the model degrades slowly. Unlike a crash (which triggers an alert), a silent data issue (e.g., 10% of documents missing a feature) degrades model quality by 1-2% per week. By the time someone notices in A/B test results, you have lost weeks of model quality. Automated data quality monitoring with hard alerts is non-negotiable at web scale.

FRONTIER: Synthetic Data for Search (2024-2025)

InPars and Promptagator use LLMs to generate synthetic (query, document) pairs for training retrieval models. Given a document, GPT-4 generates plausible queries that the document would answer. This lets you bootstrap training data for new domains (medical, legal, code) where labeled data is scarce. UDAPDR (2024) combines synthetic data generation with domain adaptation, achieving near-human-labeled quality for specialized verticals.

The synthetic data pipeline: (1) Sample 100K diverse documents from the target domain. (2) For each document, prompt GPT-4: "Generate 5 natural search queries that this document would be the best answer for." (3) Use the generated (query, document) pairs as positive training examples. (4) Mine hard negatives from BM25. (5) Fine-tune your bi-encoder on this data. Cost: ~$500 for 500K synthetic pairs (at GPT-4 pricing). Quality: within 3-5% of human-labeled data on most benchmarks. Time: 24 hours end-to-end vs. 4-6 weeks for human annotation.

Data Pipeline Monitor

Watch data flow through the pipeline. Each stage shows throughput and health. Inject a data quality issue to see how it propagates downstream.

Your LTR model's offline NDCG has been slowly degrading over 3 months. No model changes were made. What is the most likely cause?

Chapter 10: The Convergence

For decades, search and recommendations were separate disciplines with separate teams, separate models, and separate infrastructure. Search people cared about queries and documents. Recommendations people cared about users and items. But the boundaries are dissolving. This chapter explains why, and what the unified future looks like.

CONCEPT: Why They Are Converging

Three forces are driving the merger:

1. Shared infrastructure. Both search and recommendations use: embedding models (bi-encoders), vector indexes (FAISS/HNSW), feature stores (Redis), LTR models (LambdaMART), and A/B testing frameworks. Running separate infrastructure for each is wasteful. Unifying saves 30-50% in infrastructure costs.

2. Shared embeddings. A user searching "wireless headphones" and a user who browsed three headphone product pages express the same intent through different channels. The embedding model that encodes "wireless headphones" and the embedding model that encodes a user's browsing history can (and should) produce vectors in the same space. Then both explicit search and implicit recommendations become nearest-neighbor lookups in one index.

3. Generative models blur the line. When a user asks ChatGPT "what are the best wireless headphones?", is that search or a recommendation? It is both. The model retrieves information (search), personalizes based on context (recommendation), and generates a coherent response (language modeling). The three capabilities are inseparable.

CONCEPT: Retrieval-Augmented Generation (RAG)

RAG combines retrieval (search) with generation (LLM). Given a query, first retrieve relevant documents using dense retrieval, then condition the LLM on the retrieved documents to generate an answer. This is the architecture behind every production LLM chatbot that needs to answer questions about up-to-date or private information.

1. Embed Query
Run query through bi-encoder: "best headphones" → [0.23, -0.15, ...] (128-d vector)
2. Retrieve
ANN search in product index. Return top-10 product descriptions with scores.
3. Generate
Feed query + retrieved documents to LLM. Generate a personalized answer with citations.

RAG is where search meets generation. The retriever is a search system. The generator is a recommendation system (it selects and synthesizes information for this specific user). And the whole thing runs on transformers.

WORKED EXAMPLE: RAG Data Flow

Let us trace a complete RAG request with concrete tensor shapes and timing.

User query: "What are the latest iPhone 16 features?" Step 1: Embed query (5ms) Tokenize: [101, 2054, 2024, 1996, 6745, ..., 102] → (1, 12) token IDs BERT forward: (1, 12, 768) hidden states [CLS] + projection: (1, 128) query embedding L2 normalize: (1, 128) unit vector Step 2: ANN retrieval (3ms) Search FAISS IVF4096_PQ64 index with nprobe=32 Returns: top-5 document IDs + distances [doc_42891, doc_15023, doc_8847, doc_91204, doc_3327] Step 3: Fetch documents (2ms) doc_42891: "iPhone 16 features 48MP camera, A18 chip..." (512 tokens) doc_15023: "Apple announces iPhone 16 lineup with..." (487 tokens) ... Total context: ~2000 tokens Step 4: LLM generation (800ms for ~200 output tokens) Prompt: "Given the following context:\n{context}\n\nAnswer: {query}" Input: ~2100 tokens (system + context + query) Output: ~200 tokens at 250 tokens/sec Total latency: 5 + 3 + 2 + 800 = 810ms Bottleneck is overwhelmingly the LLM generation step. This is why RAG latency optimization focuses on the LLM, not retrieval.

CONCEPT: Generative Retrieval

Generative retrieval goes further: instead of retrieving documents and then generating, the model directly generates the answer from its parametric memory, augmented by retrieval when needed. The Differentiable Search Index (DSI) memorizes document IDs during training and generates them given a query. GENRE generates entity names (Wikipedia titles) instead of opaque IDs, making the output interpretable.

The radical idea: the entire inverted index, vector index, and reranking pipeline could be replaced by a single very large language model that has "memorized" the web. We are not there yet — current generative retrieval works at millions of documents, not billions — but the trajectory is clear.

The practical implication for your career: if you understand both the traditional pipeline (BM25, inverted index, feature engineering, LambdaMART) AND the neural approaches (dense retrieval, transformers, generative retrieval), you can navigate the transition. Companies need people who can maintain the existing system while building the next one. That is exactly the staff-level skill set this lesson prepares you for.

DESIGN: When to Use Which Convergence Pattern

Use caseBest approach (2025)Why
Web search (50B docs)Hybrid BM25 + dense, cross-encoder rerankScale demands BM25 first-pass; cross-encoder adds semantic understanding
E-commerce search (10M products)Unified two-tower for search + recsSmall enough for one index; users seamlessly switch between searching and browsing
Enterprise knowledge base (1M docs)RAG with cross-encoder gatingUsers want answers, not links; RAG generates natural language responses
Content feed (recs-heavy)Sequence model (SASRec) + two-tower candidate genTemporal user interest matters more than query matching
Conversational search (agent)Multi-turn RAG with query decompositionUsers iteratively refine; agent maintains context across turns

CODE: RAG Pipeline

python
import faiss
import numpy as np

class RAGPipeline:
    def __init__(self, encoder, index, docs, llm_client):
        self.encoder = encoder      # bi-encoder for queries
        self.index = index          # FAISS ANN index
        self.docs = docs            # list of document texts
        self.llm = llm_client       # LLM API client

    def answer(self, query: str, top_k=5) -> str:
        # 1. Embed query
        q_emb = self.encoder.encode(query)  # (1, 128)

        # 2. Retrieve
        _, indices = self.index.search(q_emb, top_k)
        context = "\n\n".join(self.docs[i] for i in indices[0])

        # 3. Generate
        prompt = f"""Answer based on the context below.
Context: {context}
Question: {query}
Answer:"""
        return self.llm.generate(prompt, max_tokens=500)

DEBUG: RAG Failures

Symptom: RAG system confidently generates wrong answers even when the correct document was retrieved. Root cause: The LLM ignores the retrieved context and relies on its parametric knowledge (which is outdated). This is called "retrieval neglect." Fix: (1) Put retrieved context at the end of the prompt (closer to the generation position), (2) use "Answer ONLY based on the provided context" instruction, (3) fine-tune the LLM specifically for RAG (REPLUG, RAFT).

Symptom: RAG returns irrelevant passages and the LLM hallucinates an answer based on them. Root cause: The retrieval model and the generation model have a "domain gap." The retriever was trained on one distribution (e.g., Wikipedia) but is being used on another (e.g., technical documentation). Fix: (1) Fine-tune the retriever on domain-specific data. (2) Add a relevance threshold — if no retrieved document scores above 0.5, return "I don't know" instead of hallucinating. (3) Use a cross-encoder as a relevance gate between retrieval and generation.

Symptom: RAG works well for factual questions but fails for comparison/synthesis queries ("compare X vs Y"). Root cause: Each retrieval pass returns documents about X or Y, but not documents that compare them. The LLM must synthesize across multiple documents, which is harder. Fix: (1) Decompose the query: separately retrieve for X and Y, then prompt the LLM to compare. (2) Use multi-hop retrieval: retrieve once, read, generate a follow-up query, retrieve again. (3) Increase the number of retrieved documents (k=10 instead of k=5) to increase the chance of getting diverse perspectives.

CODE: RAG with Relevance Gating

python
class GatedRAG:
    """RAG with a cross-encoder relevance gate.
    If no retrieved doc passes the relevance threshold,
    return a 'not enough information' response instead of hallucinating."""

    def __init__(self, retriever, reranker, llm, threshold=0.5):
        self.retriever = retriever
        self.reranker = reranker     # cross-encoder
        self.llm = llm
        self.threshold = threshold

    def answer(self, query: str) -> str:
        # 1. Retrieve candidates
        candidates = self.retriever.search(query, top_k=20)

        # 2. Rerank and filter by relevance
        scored = self.reranker.rerank(query, candidates, top_k=5)
        relevant = [(doc, s) for doc, s in scored if s > self.threshold]

        # 3. Gate: if nothing relevant, don't hallucinate
        if not relevant:
            return "I don't have enough reliable information to answer this."

        # 4. Generate answer from relevant context only
        context = "\n\n".join(doc for doc, _ in relevant)
        return self.llm.generate(
            f"Context:\n{context}\n\nQuestion: {query}\n"
            f"Answer based ONLY on the context above:"
        )

CODE: Unified Embedding for Search + Recs

python
import torch
import torch.nn as nn
import torch.nn.functional as F

class UnifiedEncoder(nn.Module):
    """One encoder, two heads: search + recommendations."""
    def __init__(self, base_encoder, dim=768, proj_dim=128):
        super().__init__()
        self.encoder = base_encoder  # shared BERT backbone
        # Shared bottom layers, separate projection heads
        self.search_head = nn.Linear(dim, proj_dim)   # for explicit queries
        self.rec_head = nn.Linear(dim, proj_dim)      # for user profiles
        self.item_head = nn.Linear(dim, proj_dim)     # shared item encoding

    def encode_query(self, input_ids, mask):
        h = self.encoder(input_ids, mask).last_hidden_state[:, 0]
        return F.normalize(self.search_head(h), dim=-1)

    def encode_user(self, history_embs):
        # Mean pool over user's last N item embeddings
        h = history_embs.mean(dim=1)  # (B, dim)
        return F.normalize(self.rec_head(h), dim=-1)

    def encode_item(self, input_ids, mask):
        h = self.encoder(input_ids, mask).last_hidden_state[:, 0]
        return F.normalize(self.item_head(h), dim=-1)

    def search_loss(self, q_ids, q_mask, d_ids, d_mask, tau=0.05):
        q = self.encode_query(q_ids, q_mask)
        d = self.encode_item(d_ids, d_mask)
        return F.cross_entropy(q @ d.T / tau, torch.arange(len(q), device=q.device))

    def rec_loss(self, user_hist, item_ids, item_mask, tau=0.1):
        u = self.encode_user(user_hist)
        i = self.encode_item(item_ids, item_mask)
        return F.cross_entropy(u @ i.T / tau, torch.arange(len(u), device=u.device))
Shared backbone, separate heads. The BERT encoder learns general text understanding from both search and recommendation signals. The search head optimizes for query-document relevance. The rec head optimizes for user-item affinity. The item head is shared — documents live in one embedding space regardless of whether they were retrieved by search or recommended by the rec system. This is the architectural insight that makes unification work.

DESIGN: Unified Search + Recs Architecture

The converged architecture: one embedding model serves both search queries and user profiles. One ANN index stores all content. Search queries and user profiles are both embedded into the same space. The only difference is the ranking stage: search applies a relevance-focused reranker, recommendations apply a personalization-focused reranker. Both feed into the same result assembly layer that handles diversity, business rules, and A/B testing.

DEBUG: When Unification Hurts

Symptom: After unifying the search and recommendation embedding models, search quality drops 2% while recommendation quality improves 3%. Root cause: The two tasks have conflicting gradient signals. Search wants "wireless headphones" close to product reviews; recommendations want it close to related products the user has purchased (which may include cables, cases, stands). The shared encoder is compromised. Fix: Share only the bottom 8 of 12 transformer layers; keep the top 4 task-specific. This lets the lower layers learn general text understanding while the upper layers specialize for each task. This "hard parameter sharing with task-specific heads" pattern is standard in multi-task learning.

FRONTIER: The Next Five Years (2024-2029)

Search as a conversation. Users will increasingly interact with search through multi-turn dialogue, not single queries. The system must maintain context across turns (remember that "it" refers to the headphone from the previous query).

Agentic search. AI agents that can browse the web, fill out forms, compare prices, and complete purchases — this is exactly what Parallel is building. The agent IS the search interface. The agent does not return ten blue links; it returns an answer, a comparison table, or a completed action. This requires the agent to have a full search pipeline internally (retrieve, rank, synthesize) plus the ability to take actions (click, fill forms, submit orders).

Personalized foundation models. Instead of a generic retrieval model + personalization layer, train user-specific adapters (LoRA) that bake personalization into the model itself. Early results show 5-10% improvement over two-stage approaches.

Multimodal search and recommendations. Users increasingly search with images (screenshot this product, find something similar), voice (conversational queries), and mixed modality (photo of a dish → find recipe). Models like SigLIP (2024) and Jina CLIP v2 produce embeddings for text and images in the same space, enabling cross-modal retrieval with the same ANN infrastructure.

DESIGN: The Parallel Architecture

How this all comes together at a company like Parallel, which builds AI agents with programmatic web access:

1. Agent receives user intent
"Find me the best noise-canceling headphones under $200 with good reviews for commuting." This is not a simple search query — it has multiple constraints (price, use case, review quality).
2. Agent decomposes into sub-queries
The agent generates structured search queries: "noise canceling headphones", filters: price<$200, check reviews mentioning "commute." Each sub-query hits the search pipeline.
3. Search + Recommendation pipeline
For each sub-query: retrieve candidates (hybrid BM25 + dense), enrich with features (price, rating, review count), rerank with cross-encoder, filter by constraints.
4. Agent synthesizes and acts
The agent compares results across sub-queries, generates a recommendation with reasoning ("Sony WH-1000XM5 ranks highest because..."), and can complete the purchase if authorized.
Convergence Diagram

Three domains converging into one system. Toggle between 2015 (separate) and 2025 (unified) architectures to see how the boundaries dissolve.

A user asks your RAG system "What is the latest iPhone?" The system retrieves a document about iPhone 15 (correct) but generates an answer about iPhone 14 (incorrect). What happened?

Chapter 11: Showcase — Interactive Search Pipeline

Everything from the previous 10 chapters comes together here. You are going to operate a complete search pipeline — from query to ranked results — and tune every stage. Type a query, choose your retrieval model, set the latency budget, and watch the pipeline process your request in real time. Break it. Fix it. Understand how every knob affects quality and performance.

This is your system design interview, live. An interviewer asks: "Design a search system that handles 50K QPS with <200ms p99 and >0.7 NDCG@10." You draw this pipeline on the whiteboard, explain each stage, discuss the tradeoffs of each knob, and defend your choices. This simulation lets you practice exactly that.
Full Search Pipeline Simulator

Configure the pipeline, then click Execute Query to watch it run. Adjust parameters to optimize the quality-latency tradeoff. Try to hit NDCG>0.7 with latency<200ms.

Retrieval
Reranker
Index size (docs) 1B
Rerank candidates 100
Relevance threshold 0.30

How to Read the Simulator

The top section shows two critical metrics: latency (must be <200ms) and NDCG@10 (want >0.7). The waterfall chart below shows each pipeline stage's contribution to total latency. The red dashed line at 200ms is the budget. If any bar extends past it, you are over budget.

The interaction between settings reveals the fundamental tradeoffs of search system design:

More reranking candidates = higher quality but higher latency. Going from 50 to 200 candidates improves NDCG by ~2% but triples reranking time. Going from 200 to 500 gives diminishing returns (<0.5% NDCG) but doubles the time again. The production sweet spot is 100-200 candidates.

Bigger index = higher latency for both BM25 and ANN. A 100B-document index takes 2-3x longer to search than a 1B index. This is where sharding and caching earn their keep — without them, latency grows linearly with index size.

Cross-encoder vs. ColBERT vs. LambdaMART. Cross-encoder gives the best quality but highest latency. ColBERT is ~60% of the latency at ~90% of the quality (token-level interaction without full cross-attention). LambdaMART is nearly free latency-wise (tree traversal is microseconds) but relies on hand-crafted features.

Experiment with these scenarios:

ScenarioSettingsWhat you learn
Speed demonBM25 only, no reranker, 10M docsUltra-fast but low relevance — misses semantic matches
Quality maximizerHybrid, cross-encoder, 500 candidatesBest NDCG but latency blows the budget at scale
Production sweet spotHybrid, cross-encoder, 100 candidates, 1B docsThe standard production configuration — balance of quality and speed
ColBERT middle groundHybrid, ColBERT, 200 candidatesBetter latency than cross-encoder at slightly lower quality

Tuning Guide: How to Hit Both Targets

If you cannot hit NDCG > 0.7 and latency < 200ms simultaneously, here is the decision tree:

Latency too high? (1) Reduce reranking candidates from 200 to 100 (—1% NDCG, —50% rerank latency). (2) Switch from cross-encoder to ColBERT (—2% NDCG, —60% rerank latency). (3) Reduce ANN nprobe (—0.5% recall, —30% ANN latency). (4) Add result caching for head queries (0% quality impact, —20-40% average latency).

NDCG too low? (1) Switch from BM25-only to hybrid retrieval (+5-10% NDCG, +5ms latency). (2) Add cross-encoder reranking (+8-12% NDCG, +40ms latency). (3) Increase reranking candidates from 50 to 200 (+2-3% NDCG, +30ms latency). (4) Fine-tune the bi-encoder on domain-specific data (+3-5% NDCG, no latency impact).

The Pareto frontier for web search: you cannot improve NDCG without increasing latency, and vice versa. The art is finding the point on this frontier that matches your product's requirements. A search engine optimizes for quality (users wait 200ms). An autocomplete system optimizes for speed (must respond in 50ms). An ad auction optimizes for revenue-per-query within a latency budget.

In a system design interview, you must justify your reranking choice. The interviewer asks: "Why not just use the cross-encoder on all 1000 retrieved candidates instead of just the top 100?" What is the best answer?

Chapter 12: Interview Arsenal

This chapter distills everything into the formats you will face in a staff-level interview at Parallel (or any web-scale AI company). You get: a cheat sheet of the most important concepts, system design questions with solution skeletons, coding exercises, and debugging scenarios.

Cheat Sheet: Core Concepts

ConceptOne-sentence definitionWhen to mention it
BM25Lexical scoring with TF saturation and length normalization — the baseline that every system includesAny retrieval question
Bi-encoderIndependent encoding of query and document into shared embedding space, scored by dot productDense retrieval, candidate generation
Cross-encoderJoint encoding of (query, document) pair for maximum accuracy at high latency costReranking stage
ColBERT (MaxSim)Token-level late interaction: more accurate than bi-encoder, faster than cross-encoder, but larger indexWhen asked about the bi-encoder vs cross-encoder tradeoff
InfoNCE lossContrastive loss that pushes positive pairs together and negative pairs apart using in-batch negativesTraining embedding models
Two-tower modelBi-encoder for recommendations: user tower + item tower, inner product scoringRecommendation system design
LambdaMARTGradient-boosted trees with NDCG-aware gradient weighting — production standard for LTRLearning to rank, feature engineering
NDCG@kStandard ranking metric that accounts for graded relevance and position discountAny evaluation question
IVF-PQFAISS compound index: cluster-then-compress for billion-scale ANN searchServing, latency, scale questions
RAGRetrieve relevant documents, then generate answer conditioned on themAny question about LLMs + search

System Design Questions

Q1: "Design a search system for a web index of 50 billion documents."

Skeleton: (1) Two-stage retrieval: BM25 over sharded inverted index + dense ANN over sharded vector index, in parallel. (2) Merge top-1000 candidates. (3) Feature enrichment from Redis feature store. (4) LambdaMART for lightweight ranking (1000 → 200). (5) Cross-encoder reranking on GPU pool (200 → 20). (6) Result assembly with diversity and personalization. Latency budget: 150ms. Key decisions: scatter-gather fanout width, reranking batch size, cache layers.

Q2: "Design a unified search + recommendation system."

Skeleton: One bi-encoder produces embeddings for both queries and user profiles. One ANN index stores all content embeddings. Search: query → embed → ANN → rerank (relevance-focused). Recs: user history → embed → ANN → rerank (personalization-focused). Shared infrastructure: embedding model, vector index, feature store, A/B testing. Separate: reranking models (different objectives), result assembly rules (different diversity constraints).

Q3: "Your search quality has degraded 5% over the last month. Diagnose."

Skeleton: (1) Check offline metrics on recent eval sets — is the model itself worse? (2) Check data pipeline health — are training features stale, missing, or distribution-shifted? (3) Check index freshness — has the crawl coverage dropped? (4) Check serving infrastructure — are latency spikes causing timeouts in the reranking stage, falling back to lower-quality results? (5) Check external factors — has the query distribution shifted (seasonal, news events)?

Coding Exercises

Exercise 1: Implement BM25. Given a corpus and query, compute BM25 scores and return top-k. Handle edge cases: empty queries, missing terms, single-document corpus. Expected: 30-40 lines. (See Chapter 1.)
Exercise 2: Implement NDCG@k. Given a list of relevance grades in ranking order, compute NDCG@k. Handle edge cases: all-zero relevances, k larger than list length. Expected: 15-20 lines. (See Chapter 8.)
Exercise 3: Build a FAISS index. Given a numpy array of embeddings, build an IVF-PQ index, train it, and implement search. Tune nprobe for recall vs. latency tradeoff. Expected: 20-30 lines. (See Chapter 6.)
Exercise 4: Implement position-debiased CTR. Given a click log of (query, document, position, clicked), compute position-debiased relevance estimates using the Position-Based Model. Expected: 25-35 lines. (See Chapter 8.)
Exercise 5: Implement Reciprocal Rank Fusion. Given two ranked lists of document IDs (one from BM25, one from dense retrieval), combine them using RRF with k=60. Return the fused ranking. Expected: 10-15 lines. (See Chapter 12.)
Exercise 6: Implement MMR diversity reranking. Given a set of document embeddings and their relevance scores, select k documents that maximize relevance while minimizing redundancy. Expected: 20-30 lines. (See Chapter 8.)
Exercise 7: Implement a streaming feature counter. Given a stream of (doc_id, event_type) events, maintain a sliding-window count of events per document over the last hour. Expected: 25-35 lines. (See Chapter 9.)

Coding Exercise Solutions: Key Patterns

For each coding exercise, here are the patterns interviewers want to see:

python
# EXERCISE 5: Hybrid retrieval fusion (Reciprocal Rank Fusion)
# Given two ranked lists, combine them into one using RRF
def reciprocal_rank_fusion(rankings: list[list[int]], k=60) -> list[int]:
    """Combine multiple ranked lists using Reciprocal Rank Fusion.
    Each ranking is a list of document IDs in rank order.
    RRF score = sum(1 / (k + rank_i)) across all rankings.
    """
    scores = {}
    for ranking in rankings:
        for rank, doc_id in enumerate(ranking):
            scores[doc_id] = scores.get(doc_id, 0) + 1.0 / (k + rank + 1)
    return sorted(scores, key=scores.get, reverse=True)

# Usage: combine BM25 and dense retrieval results
bm25_results = [42, 17, 89, 3, 55]     # top-5 from BM25
dense_results = [89, 42, 7, 31, 17]    # top-5 from dense
hybrid = reciprocal_rank_fusion([bm25_results, dense_results])
# Result: [42, 89, 17, ...] — docs appearing in both lists rank highest
RRF vs. score normalization. The alternative to RRF is normalizing BM25 and dense scores to the same scale, then taking a weighted sum. RRF is preferred in practice because it is score-agnostic (works even when BM25 scores are on a completely different scale than cosine similarities), parameter-free (k=60 works well universally), and easy to extend to 3+ retrieval sources.

System Design Skeletons: Detailed Walkthroughs

Q4: "Your CEO says: add personalization to search. You have 2 engineers and 3 months."

Skeleton: (1) Week 1-2: instrument click logging. Every search result click gets logged with (user_id, query, doc_id, position, timestamp, dwell_time). (2) Week 3-4: build a simple user profile. For each user, maintain the last 50 clicked documents' embeddings. The user profile is the mean of these embeddings. (3) Week 5-8: modify the result assembly layer. For each candidate document, compute cosine similarity between the document embedding and the user profile embedding. Add this as a feature to the LTR model. Retrain LambdaMART with this new feature. (4) Week 9-10: A/B test. (5) Week 11-12: iterate based on results. This is the minimum viable personalization — it adds ~3-5% NDCG improvement with minimal infrastructure investment.

Q5: "How would you build a vertical search engine (e.g., for recipes, jobs, or code) on top of a general web index?"

Skeleton: (1) Query classification: train a text classifier to detect vertical intent ("chicken parmesan recipe" → recipes vertical). (2) Vertical-specific indexing: extract structured fields (ingredients, cook time, rating) during crawl. Store as additional columns in the feature store. (3) Vertical-specific ranking: train a separate LTR model for the vertical, with vertical-specific features (e.g., recipe rating, ingredient match). (4) Vertical-specific UX: when vertical is triggered, show a rich result card (recipe card with image, time, rating) instead of standard blue links. (5) Fallback: if confidence in vertical classification is low, show standard results with the vertical card as a secondary result.

Debugging Scenarios

Scenario 1: "NDCG dropped 2% after a model update." Walk through: (1) Compare feature distributions between old and new training data. (2) Check for label noise (mislabeled training examples from click model changes). (3) Inspect specific query segments where quality dropped. (4) Check for data leakage if you added new features. (5) Roll back and A/B test with the old model to confirm the regression.
Scenario 2: "ANN latency spiked 3x after adding 500M documents." Walk through: (1) Check if IVF centroids need retraining. (2) Check memory — is the index being swapped to disk? (3) Check if nprobe needs adjustment for the larger index. (4) Consider adding more shards to keep per-shard size constant. (5) Implement pre-warming for new index shards.
Scenario 3: "Users complain results are 'all the same.'" Walk through: (1) Measure intra-list diversity (Jaccard similarity between top-10 results). (2) Check if the model is over-optimizing for CTR (leading to popularity bias). (3) Add MMR (Maximal Marginal Relevance) to the result assembly stage. (4) Add domain diversity constraints. (5) A/B test diversity intervention against pure relevance ranking.
Scenario 4: "The model works great on English but terrible on Spanish." Walk through: (1) Check training data distribution — how many Spanish (query, doc) pairs? If <5% of training data, the model undertrained on Spanish. (2) Check the tokenizer — is it multilingual? A BERT-base tokenizer trained on English will fragment Spanish words into many subwords, losing semantic information. (3) Switch to a multilingual base model (mBERT, XLM-R). (4) Collect Spanish-specific training data using synthetic data generation (translate English training data, or use InPars with Spanish documents). (5) Add language-specific eval sets to catch regressions early.

Rapid-Fire Interview Questions

These are the quick-fire "warm-up" questions interviewers use in the first 5 minutes. Have a crisp 2-3 sentence answer ready for each:

QuestionAnswer skeleton
"What is the difference between precision and recall?"Precision = fraction of retrieved docs that are relevant. Recall = fraction of relevant docs that are retrieved. You trade off between them: more aggressive retrieval increases recall but hurts precision.
"Why use cosine similarity instead of dot product?"Cosine normalizes by vector magnitude, so it measures angle between vectors (semantic direction) regardless of scale. Dot product rewards both alignment AND magnitude. Use cosine when embeddings are L2-normalized (they are equivalent); use dot product when magnitude carries information (e.g., document importance).
"What is the curse of dimensionality?"In high dimensions, all points become equidistant. With 768-d embeddings, the ratio of nearest-to-farthest neighbor distance approaches 1.0. This makes exact nearest neighbor search useless — you need approximate methods that exploit local structure.
"Why not just use the LLM directly for retrieval?"Cost and latency. A BERT forward pass takes 5ms; a GPT-4 call takes 500ms and costs 100x more. At 50K QPS, you cannot afford an LLM call per query. Use cheap models (BM25, bi-encoder) for retrieval, expensive models (cross-encoder, LLM) only for reranking the top candidates.
"Explain the bias-variance tradeoff in ranking."A high-bias model (BM25) consistently underperforms because it cannot capture semantics, but it is stable across datasets. A high-variance model (fine-tuned cross-encoder) can overfit to training data distribution and degrade on new query patterns. The production solution: ensemble multiple models (BM25 + dense + LTR) to reduce variance while capturing complex patterns.

Mock Interview Transcript

Here is how a staff-level interview exchange should sound. Practice delivering this level of specificity:

Interviewer: "Your reranking model improved NDCG@10 by 3% offline, but the A/B test showed no significant change in CTR after 2 weeks. What do you do?"

Strong answer: "First, I would check the A/B test setup itself — was traffic split correctly? Is the sample size sufficient for the expected effect size? A 3% NDCG improvement might translate to only 0.2% CTR improvement, which needs more traffic to detect. Second, I would segment the results by query type. Maybe the improvement is concentrated on informational queries where users do not click (they read the snippet) while transactional queries (where users click) saw no improvement. Third, I would check for position bias in the NDCG evaluation — if the offline eval uses biased click labels, the 3% might be an artifact. Fourth, I would look at secondary metrics: did dwell time improve? Did reformulation rate decrease? These might show improvement even when CTR does not. Finally, I would consider running an interleaving experiment, which is 10-100x more sensitive than A/B testing for detecting ranking quality differences."

Interviewer: "How would you handle the cold start problem for a new user who has never searched before?"

Strong answer: "Three strategies, deployed in layers. (1) Population-level priors: for a brand new user, use the most popular results for each query, weighted by the user's geographic region and device type (the only signals we have). (2) Quick profile bootstrap: after just 3-5 clicks, compute a rudimentary user embedding by averaging the embeddings of clicked documents. This immediately enables basic personalization. (3) Explicit preference collection: on first visit, show a 'pick 5 topics that interest you' widget. This provides direct signal without requiring any click history. The key insight is that cold start is a spectrum, not a binary — you get progressively better with each interaction, and your system should degrade gracefully to population-level baselines, never to random."

Key Papers to Know

PaperYearKey contribution
DPR (Karpukhin et al.)2020Dense Passage Retrieval with bi-encoders + hard negatives
ColBERT (Khattab & Zaharia)2020Late interaction with per-token embeddings
ANCE (Xiong et al.)2021Approximate nearest neighbor negative mining for training
SPLADE (Formal et al.)2021Learned sparse representations compatible with inverted indexes
DSI (Tay et al.)2022Differentiable Search Index — generative retrieval
InPars (Bonifacio et al.)2022LLM-generated synthetic queries for training
Matryoshka (Kusupati et al.)2022Multi-resolution embeddings from a single model
GritLM (Muennighoff et al.)2024Unified generative + embedding model
E5-Mistral (Wang et al.)2024LLM-based embeddings that rival specialized models
LLM2Vec (BehnamGhader et al.)2024Convert any decoder LLM into a strong embedding model

What to Study in Each Paper

Interviewers do not expect you to have memorized every paper. But for each, know:

DPR: Why in-batch negatives matter (cheap 1024 negatives from a batch of 1024). Why hard negatives from BM25 top-100 boost quality 10-15%. The architecture (two BERT-base encoders, one for query, one for passage).

ColBERT: The MaxSim operation and why it captures token-level interactions. The storage cost (one vector per token). ColBERTv2's residual compression that makes it practical.

SPLADE: How it bridges lexical and neural retrieval. The key idea (predict term weights for ALL vocabulary tokens, not just those present in the document). Compatibility with inverted indexes.

Matryoshka: The training trick (apply loss at multiple truncation points simultaneously). Why it enables adaptive-precision search (coarse 32-d retrieval → fine 768-d reranking from the same model).

GritLM: The unification thesis (one model for both embedding and generation). Why this matters for the convergence of search and GenAI. Current limitations (not yet competitive with specialized models at billion-scale retrieval).

Final Advice for the Interview

Open with the architecture. When asked any system design question about search, start by drawing the 5-stage pipeline on the whiteboard (query understanding → retrieval → features → reranking → assembly). Then drill into whichever stage the interviewer asks about. This shows you have the full picture and can zoom in anywhere.

Always quantify. "It's fast" means nothing. "BM25 over 64 shards returns top-1000 in 12ms p99" shows you have built real systems. Memorize the key numbers from the table in Chapter 0.

Discuss tradeoffs, not solutions. Staff-level engineers are expected to navigate tradeoffs, not just implement solutions. "Cross-encoder is the best reranker" is junior-level. "Cross-encoder gives the best NDCG but costs 40ms per query for 100 candidates; ColBERT gives 90% of the quality at 40% of the latency; LambdaMART is near-free but requires manual feature engineering — I would start with LambdaMART and add cross-encoder only for high-value verticals" is staff-level.

Connect to the product. Technical excellence is necessary but not sufficient at the staff level. Always connect your technical decisions to user outcomes: "This 2% NDCG improvement translates to roughly 500K fewer reformulation events per day, meaning half a million users per day find what they need on the first try instead of having to rephrase their query."

Show the debugging mindset. When presented with a problem, do not jump to solutions. First, gather data: "What changed recently? What do the metrics show? Can we segment by query type, user type, device, geography?" Then form hypotheses. Then test them. This systematic approach is what separates staff engineers from senior engineers.

Interview Prep Radar

Self-assess your readiness across the five interview dimensions. Click a dimension to review the relevant chapters.

CONCEPT 3
DESIGN 3
CODE 3
DEBUG 3
FRONTIER 3
An interviewer says: "Convince me that search and recommendations should share one embedding model." What is the strongest argument?
Closing thought.

Search and recommendations are the most impactful ML systems in the world. Google Search serves 8.5 billion queries per day. YouTube's recommendation system accounts for 70% of watch time. Amazon's "customers also bought" drives 35% of revenue. The models are not the most sophisticated in ML (no diffusion, no RLHF, no chain-of-thought reasoning). But they operate at a scale and latency constraint that makes them arguably the hardest engineering problem in AI. You are not just building a model — you are building the infrastructure that connects billions of people with the information and products they need. That is a profoundly meaningful mission, and the interview is your chance to show you can execute it at the highest level.

"We tend to overestimate the effect of a technology in the short run and underestimate the effect in the long run." — Roy Amara

This lesson covers the full scope of a staff-level web-scale search and recommendations role. The concepts, architectures, code, debugging strategies, and frontier research presented here represent the state of the art as of 2025. For the latest developments, follow SIGIR, WWW, RecSys, NeurIPS IR workshops, and the research blogs of Google, Meta, and Microsoft.

Related Gleams lessons: Transformers for attention mechanics, CLIP / Contrastive Learning for embedding training, Agentic Engineer for agent-based search systems.