Staff-level interview prep: information retrieval, dense embeddings, transformer ranking, recommendation systems, distributed training, and serving a web-scale index.
A user types "best noise-canceling headphones under $200 for commuting" into a search bar. In under 200 milliseconds, your system must understand the intent (product discovery, not troubleshooting), match it against a web index of 50 billion documents, rank the results by relevance and personalization, inject sponsored results without destroying trust, and return a page that feels magically tailored. If any stage takes too long, the user hits "back" and you lose them forever.
This is not keyword matching. This is a real-time intelligence pipeline that fuses information retrieval, recommendation systems, and transformer-based understanding into a single sub-200ms response. And you are the research scientist who designs, trains, and serves the models that make it work at web scale.
It is 9:00 AM. You badge into the office at Parallel, a $2B web infrastructure company building AI agents with programmatic web access. On your first monitor, the nightly training job for the cross-encoder reranker finished — but the NDCG@10 on the held-out test set dropped 0.8% compared to the previous checkpoint. On your second monitor, the serving team reports that p99 latency for ANN retrieval spiked from 12ms to 45ms after yesterday's index update added 200M new documents. On your third monitor, a Slack thread where the recommendations team is proposing to unify the search embedding model with the recommendation embedding model — one model to rule both explicit queries and implicit browsing signals.
Before lunch, you will diagnose the NDCG regression (a data pipeline bug duplicated 3% of training pairs, biasing the model toward popular documents), implement a quantization experiment to shrink the ANN index by 4x without exceeding the latency budget, and draft a design doc for the unified embedding model (arguing that search and recs should share the first 8 transformer layers but have separate projection heads for their different similarity metrics).
This is the daily reality of a Research Scientist — Web-Scale Search & Recommendations. You sit at the intersection of three domains that are rapidly converging:
| Domain | What it owns | Your daily intersection |
|---|---|---|
| Information Retrieval | Query understanding, document indexing, BM25, inverted indexes | You design the scoring functions that decide which documents are relevant |
| Recommendation Systems | Collaborative filtering, content-based, user modeling, cold start | You build the neural recommenders that personalize results |
| Transformer Models | BERT ranking, dense retrieval, cross-encoders, generative retrieval | You train and serve the models that understand both queries and documents |
| Systems Engineering | Distributed training, ANN indexing, quantization, caching, latency budgets | You make billion-parameter models serve 50K QPS at p99 < 200ms |
| Data Infrastructure | Crawling, deduplication, feature stores, click logging, ETL | You build the data pipelines that feed everything above |
Staff-level interviewers expect you to have these numbers at your fingertips. Not because precision matters, but because they reveal whether you have actually built systems at this scale or are just talking theoretically.
| Metric | Typical value | Why it matters |
|---|---|---|
| Web index size | 50-100 billion documents | Determines sharding strategy |
| Query volume | 50K-500K QPS (peak) | Determines fleet size |
| Latency budget | 200ms total, p99 | Every architecture decision optimizes for this |
| BM25 retrieval | 10-15ms for top-1000 | Bottlenecked by slowest shard in scatter-gather |
| ANN retrieval | 5-15ms for top-1000 | Depends on index type (HNSW ~5ms, IVF ~15ms) |
| Cross-encoder | 0.5ms per (q,d) pair | 100 candidates = 50ms with batching on GPU |
| Embedding dimension | 128-768 | Higher = more accurate, more storage |
| NDCG@10 improvement bar | +0.5-1% for a big launch | Tiny-sounding but impacts millions of queries |
| Click-through rate | 30-50% for position 1 | Falls off exponentially by position |
| A/B test duration | 7-14 days minimum | Captures day-of-week and novelty effects |
The diagram below traces a single query from arrival to results page. Every box is a system you own or co-own.
Watch a query flow through the full pipeline. Latency counters show where time is spent. Click Inject Latency to simulate an ANN index spike.
Staff-level interviews at companies building web-scale search and recommendations test you across five dimensions. Each chapter maps to one or more:
| Dimension | What they ask | Chapters |
|---|---|---|
| CONCEPT | "Explain BM25 from first principles. Why does IDF matter?" | 1, 2, 3, 4, 7 |
| DESIGN | "Design a search system that handles 50K QPS with <200ms p99" | 0, 5, 6, 9, 11 |
| CODE | "Implement a two-tower model for candidate retrieval" | 1, 2, 4, 7, 12 |
| DEBUG | "NDCG dropped 2% after a model update. Walk me through your investigation." | 5, 8, 9, 11 |
| FRONTIER | "How will generative retrieval change search architecture?" | 3, 10, 12 |
To make this concrete, here is what a week looks like in this role:
Monday: Review weekend training runs. The new bi-encoder checkpoint trained on 2B additional pairs shows +1.2% NDCG on head queries but -0.3% on tail queries. Investigate: the new training data is biased toward popular queries (head). Solution: rebalance sampling to 60% tail / 40% head in next training run.
Tuesday: Design review for the unified embedding model. Present your architecture (shared BERT backbone, separate projection heads for search and recommendations). Debate with the recommendations team: they want to add user history features into the encoder, which would break the bi-encoder assumption (documents cannot be pre-encoded if the encoding depends on the user). Compromise: keep the document encoder user-independent, add user signals in a post-hoc fusion layer.
Wednesday: Production incident. ANN p99 spiked from 12ms to 60ms after the nightly index rebuild added 200M new vectors. Root cause: the IVF centroids are stale (trained on the old distribution). Quick fix: increase nprobe from 32 to 48 (trades latency headroom for recall). Long-term fix: schedule centroid retraining every 2 weeks instead of monthly.
Thursday: Write code for a new negative mining strategy. Instead of BM25 hard negatives, you are mining from the model's own ANN index (ANCE-style). Build the pipeline: (1) encode all training queries with current model, (2) search ANN index for each, (3) filter out positives, (4) take top-10 as hard negatives. Run initial experiments.
Friday: Analyze the 2-week A/B test results for the new cross-encoder reranker. +1.8% NDCG@10, +0.4% CTR, -0.6% reformulation rate. All statistically significant (p < 0.01). Write the launch decision document with production rollout plan.
Before neural networks, before embeddings, before transformers — there was a surprisingly elegant mathematical framework for finding relevant documents. It has survived 60 years because it works. Every modern search system still uses it as a first-pass retrieval stage, and understanding it deeply is the foundation for everything that follows.
The core problem: you have a query (a few words) and a corpus (billions of documents). You need to score every document against the query and return the top-k most relevant. You have milliseconds, not seconds. The insight that makes this tractable is the inverted index — instead of scanning every document, you build a lookup table that maps each word to the list of documents containing it.
Imagine you search for "transformer architecture attention." Which documents are most relevant? A naive approach would count how many times each query word appears in each document. But a document about "attention deficit disorder" that mentions "attention" 50 times is not what you want. We need two insights:
Term Frequency (TF) measures how important a word is within a single document. If "transformer" appears 10 times in a paper about transformer architectures, that is meaningful. But raw counts are misleading — a 10,000-word document naturally has more occurrences than a 100-word abstract. So we normalize:
where count(t, d) is how many times term t appears in document d, and |d| is the total number of terms in d.
Inverse Document Frequency (IDF) measures how rare a word is across the entire corpus. The word "the" appears in nearly every document — it carries almost zero information about relevance. The word "transformer" appears in far fewer documents — it is highly discriminative. IDF captures this:
where N is total documents in the corpus and DF(t) is how many documents contain term t. The log compresses the range — a term appearing in 1 of 1,000,000 documents gets IDF ≈ 14, while a term in 100,000 of 1,000,000 gets IDF ≈ 2.3.
The final TF-IDF score for a query-document pair is:
TF-IDF has a problem: if "transformer" appears 100 times vs. 10 times, the raw TF score is 10x higher. But is the document really 10x more relevant? Probably not. BM25 (Best Match 25) fixes this with saturation — after enough occurrences, additional ones contribute diminishing returns:
Two parameters control behavior: k1 (typically 1.2) controls saturation speed — higher means more occurrences still matter. b (typically 0.75) controls document length normalization — 0 means ignore length, 1 means fully penalize long documents. The term avgdl is the average document length in the corpus.
Even with BM25, scoring every document in a 50-billion-document corpus is impossible in real time. The inverted index avoids this entirely. Instead of asking "which words does document X contain?" (a forward index), we ask "which documents contain word Y?" (an inverted index).
For a query with 3 terms, you look up 3 posting lists and intersect them. Instead of scanning 50 billion documents, you scan perhaps 50 million postings. Combined with skip pointers and block-max WAND (Weighted AND) algorithms, you can find the top-1000 BM25 results from a web-scale index in under 10ms.
The inverted index is also the key to index updates. When a new document is crawled, you compute its terms and append to the relevant posting lists. When a document is deleted, you add it to a deletion bitmap. Periodically, you compact the index (merge segments, remove deleted entries). This is exactly how Elasticsearch/Lucene works under the hood: immutable index segments + a merge policy.
At web scale, the inverted index for 50B documents consumes approximately 5-20 TB of storage (depending on compression and the number of indexed fields). This is sharded across 32-128 machines, with each shard holding a contiguous range of document IDs. The coordinator routes queries to all shards in parallel and merges their top-k results using a priority queue.
python import math from collections import Counter class BM25: def __init__(self, docs, k1=1.2, b=0.75): self.k1, self.b = k1, b self.N = len(docs) self.avgdl = sum(len(d) for d in docs) / self.N self.tf = [Counter(d) for d in docs] # term freq per doc self.dl = [len(d) for d in docs] # doc lengths self.df = Counter() # doc freq per term for d in docs: for t in set(d): self.df[t] += 1 def idf(self, term): df = self.df.get(term, 0) return math.log((1 + self.N - df + 0.5) / (df + 0.5)) def score(self, query, doc_idx): s = 0.0 dl = self.dl[doc_idx] for t in query: tf = self.tf[doc_idx].get(t, 0) num = tf * (self.k1 + 1) denom = tf + self.k1 * (1 - self.b + self.b * dl / self.avgdl) s += self.idf(t) * num / denom return s def rank(self, query, top_k=10): scores = [(i, self.score(query, i)) for i in range(self.N)] return sorted(scores, key=lambda x: -x[1])[:top_k]
Symptom: User searches "ML model deployment best practices" but top results are all about "model trains" and "deployment of troops." Root cause: BM25 is purely lexical — it cannot understand that "ML" contextualizes "model" to mean machine learning. Fix: Add query expansion (expand "ML" to "machine learning"), or use a learned query classifier to restrict to a vertical. Long-term fix: dense retrieval (Chapter 2).
Symptom: Extremely long documents always rank first. Root cause: b parameter too low (under 0.5), so length normalization is weak. Long documents accumulate more term matches. Fix: Increase b toward 0.75-0.85. Or: normalize by passage, not document.
At web scale (50B docs), a single inverted index does not fit in memory. You shard it: partition documents across N machines (by document ID hash or by term range). A query hits all shards in parallel, each returns its local top-k, and a coordinator merges the results. This is called scatter-gather. The latency is dominated by the slowest shard (tail latency), so you over-replicate hot shards and use hedged requests (send the query to 2 replicas, take whichever responds first).
Let us compute BM25 by hand for a tiny corpus. This is exactly the kind of calculation interviewers ask you to walk through on a whiteboard.
Corpus (3 documents):
| Doc | Text | Length |
|---|---|---|
| D1 | "transformer attention mechanism" | 3 |
| D2 | "attention deficit disorder treatment attention" | 5 |
| D3 | "transformer architecture design transformer" | 4 |
Query: "transformer attention"
Step 1: Compute IDF (N = 3 documents):
Wait — negative IDF? This happens with the Robertson-Sparck Jones IDF when a term appears in more than half the documents. BM25 implementations either clip to zero or use the corrected formula: IDF = log(1 + (N - DF + 0.5) / (DF + 0.5)). With the corrected formula:
Step 2: Compute BM25 per document (k1=1.2, b=0.75, avgdl=4.0):
Ranking: D1 (1.046) > D3 (0.646) > D2 (0.613). D1 wins because it contains both query terms. This is the power of multi-term queries: matching all terms is far more valuable than matching one term many times.
Even with an inverted index, naively scoring every document in the posting lists is too slow for web-scale queries. Block-Max WAND prunes the search: for each block of postings, pre-compute the maximum possible BM25 contribution. If the maximum contribution from remaining terms cannot exceed the current k-th highest score, skip the entire block. This reduces the number of full score computations by 10-100x.
The algorithm maintains a threshold (the score of the current k-th result). For each candidate document, it computes an upper bound on the BM25 score using block-level maxima. Only if the upper bound exceeds the threshold does it compute the exact score. In practice, this means a query over a 10B-document index touches ~0.001% of the postings.
Despite the rise of neural retrieval, BM25 remains in production at Google, Bing, and every major search engine as the first-stage retriever. Recent work shows that hybrid retrieval (BM25 + dense) consistently outperforms either alone. The SPLADE family of models (2024) learns sparse representations that are compatible with inverted indexes but capture semantic similarity — bridging the gap between lexical and neural retrieval. Anserini + Pyserini provide research-grade BM25 implementations used in TREC competitions.
SPLADE deserves special attention. It uses a BERT encoder to predict term weights for every token in the vocabulary — including tokens NOT present in the original document. This means the document "repairing a dripping tap" gets a non-zero weight for "faucet" and "plumbing" even though those words do not appear. The resulting sparse representation is stored in a standard inverted index, so you get semantic matching with BM25 efficiency. SPLADE++ (2024) achieves within 1-2% of dense retrieval on MS MARCO while being 10x faster to serve.
Adjust k1 (saturation) and b (length normalization) to see how BM25 scores change. The curve shows score vs. term frequency for a fixed IDF.
BM25 finds documents that share words with your query. But what about synonyms? What about paraphrases? If you search "how to fix a leaky faucet" and a document says "repairing a dripping tap," BM25 scores it at zero — no shared terms. Dense retrieval solves this by representing both queries and documents as vectors in a shared embedding space, where semantic similarity corresponds to geometric proximity.
A bi-encoder uses two separate encoder networks (often sharing weights) to independently embed the query and the document into fixed-size vectors. At retrieval time, you compute the dot product or cosine similarity between the query vector and every document vector. Because document vectors are pre-computed and stored in an index, retrieval is just a nearest-neighbor search — no neural network inference per document at query time.
The key insight: the query encoder and document encoder are trained jointly but run independently. You encode all 50 billion documents offline (takes days on a GPU cluster), store the vectors in a vector database, and at query time you only run the query encoder (one forward pass, ~5ms) followed by approximate nearest neighbor search (~10ms).
A cross-encoder takes the query and document together as a single input, separated by a [SEP] token. This allows full attention between query tokens and document tokens — "leaky" in the query can directly attend to "dripping" in the document. Cross-encoders are dramatically more accurate than bi-encoders but catastrophically slower: you must run the model once per (query, document) pair. At 100ms per pair, scoring 1000 candidates takes 100 seconds. Unusable for first-stage retrieval, perfect for reranking the top-100.
The MLP is typically a single linear layer that maps the 768-d [CLS] hidden state to a scalar relevance score. Some implementations use a two-layer MLP with ReLU activation. The model is fine-tuned end-to-end on (query, document, relevance_label) triples, where labels come from human judgments or debiased click data.
Cross-encoders achieve 5-15% higher NDCG than bi-encoders on benchmarks like MS MARCO and BEIR. The accuracy gap is largest on queries requiring deep semantic understanding ("What president was born in the same state as the inventor of the telephone?" requires connecting multiple facts). It is smallest on keyword-heavy queries where lexical matching suffices.
How do you train a bi-encoder to put semantically similar pairs close together and dissimilar pairs far apart? Contrastive learning. Given a batch of (query, positive document) pairs, you treat every other document in the batch as a negative. The loss pushes positive pairs together and negative pairs apart:
This is the InfoNCE loss (also called NT-Xent). The temperature τ controls how sharp the distribution is — lower τ makes the model more discriminative but harder to train. The batch size matters enormously: with a batch of 1024, each query has 1023 negatives, giving a much richer gradient signal than a batch of 32.
Why does temperature matter so much? At τ = 1.0, the softmax is "smooth" — many negatives contribute to the gradient, but the signal per negative is weak. At τ = 0.01, the softmax is "sharp" — only the hardest negatives contribute, giving a strong but potentially noisy gradient. In practice, τ = 0.05-0.1 works best for search embeddings. Lower values (τ = 0.01-0.03) work better for recommendation embeddings where the positive pairs are less semantically distinct.
Research shows that doubling the batch size from 128 to 256 improves retrieval recall by 2-4%, and going from 256 to 1024 adds another 3-5%. Beyond 2048, returns diminish. This is why large-batch contrastive training requires multi-GPU setups even for modest-sized encoder models.
The connection to metric learning is deep: InfoNCE with τ → 0 converges to triplet loss with the hardest negative. But unlike triplet loss (which only considers one negative per step), InfoNCE considers all negatives simultaneously, giving a much richer gradient. This is why InfoNCE replaced triplet loss as the standard for embedding training starting around 2019.
With 50 billion document vectors (768 dimensions each), exact nearest neighbor search is impossible — it requires 50B dot products per query. Approximate Nearest Neighbor (ANN) algorithms trade a small accuracy loss for massive speedup. The three dominant approaches:
| Method | How it works | Speed | Recall@100 |
|---|---|---|---|
| IVF (Inverted File) | Cluster vectors into C centroids. At query time, probe only the P nearest clusters | 10-50ms | 95-99% |
| HNSW (Hierarchical NSW) | Build a navigable small-world graph. Greedy search from top layer down | 1-10ms | 98-99.5% |
| PQ (Product Quantization) | Compress 768-dim vectors to 32-64 bytes. Approximate distances in compressed space | 5-20ms | 90-97% |
FAISS (Facebook AI Similarity Search) combines these: IVF for coarse partitioning, PQ for compression, and HNSW for the coarse quantizer. The compound index IVF4096,PQ64 means: 4096 clusters, each vector compressed to 64 bytes. This can search 1 billion vectors in under 5ms on a single machine.
python import torch import torch.nn.functional as F class BiEncoder(torch.nn.Module): def __init__(self, encoder, dim=768): super().__init__() self.encoder = encoder # e.g. BERT-base self.proj = torch.nn.Linear(dim, 128) # compress to 128-d def encode(self, input_ids, attention_mask): out = self.encoder(input_ids, attention_mask) cls = out.last_hidden_state[:, 0] # [CLS] token return F.normalize(self.proj(cls), dim=-1) def forward(self, q_ids, q_mask, d_ids, d_mask, tau=0.05): q_emb = self.encode(q_ids, q_mask) # (B, 128) d_emb = self.encode(d_ids, d_mask) # (B, 128) sim = q_emb @ d_emb.T / tau # (B, B) labels = torch.arange(len(q_emb), device=sim.device) return F.cross_entropy(sim, labels) # InfoNCE
Symptom: Dense retrieval misses exact keyword matches that BM25 catches. User searches for error code "ERR_SSL_PROTOCOL_ERROR" and dense retrieval returns generic SSL articles, not the specific error page. Root cause: The embedding model was trained on natural language, not error codes. It maps all SSL-related content to a similar region of embedding space. Fix: Hybrid retrieval — take the union of BM25 top-100 (catches exact matches) and dense top-100 (catches semantic matches). This is why every production system uses both.
Symptom: ANN recall drops from 99% to 85% after adding 500M new vectors. Root cause: The IVF centroids were trained on the old distribution. New documents cluster differently. Fix: Retrain the IVF centroids periodically, or use HNSW which does not require retraining (but uses more memory).
Symptom: The bi-encoder retrieves the right documents for English queries but fails for multilingual queries. Root cause: The model was trained primarily on English (query, document) pairs. The embedding space has a large "English cluster" and small, poorly calibrated clusters for other languages. Fix: Fine-tune on multilingual training data using mBERT or XLM-R as the base encoder. Use cross-lingual distillation: train a multilingual student model to match the embeddings of a strong English teacher model, with parallel sentence pairs providing the alignment signal.
When you combine BM25 and dense retrieval results, the fusion strategy matters. Three approaches, ranked by complexity:
| Strategy | How it works | Pros | Cons |
|---|---|---|---|
| Union + rerank | Take union of top-K from each, rerank with cross-encoder | Simple, reranker fixes errors | Wastes reranker budget on duplicates |
| Reciprocal Rank Fusion (RRF) | Score each doc by 1/(k+rank) summed across systems | Score-agnostic, no tuning needed | Ignores magnitude of score differences |
| Learned fusion (CombMNZ) | Normalize scores to [0,1], weighted sum with learned weights | Optimal given enough training data | Requires tuning, fragile to score drift |
In practice, RRF (with k=60) is the default choice because it is robust, requires no calibration, and performs within 1-2% NDCG of learned fusion. The key insight: RRF does not care about the absolute scores from each system (BM25 scores might be 5-15, cosine similarities 0.3-0.9); it only cares about the ranking order. This makes it immune to score miscalibration.
For a 10-billion-vector index at 128 dimensions with PQ64 compression: each vector takes 64 bytes. Total: 640 GB. You shard across 16 machines (40 GB each). Each machine loads its shard into memory. A query hits all 16 in parallel; each returns its local top-100; the coordinator merges to global top-100 in ~1ms. Add 2x replication for fault tolerance and tail latency hedging. Total fleet: 32 machines. This is the scale architecture question interviewers love.
Let us trace the full data flow from text to ANN result, with concrete numbers. This is what interviewers mean when they ask "walk me through dense retrieval end to end."
Memory math for the index: 1 billion documents at 128 dimensions, FP32 = 1B × 128 × 4 bytes = 512 GB. With PQ64 compression: 1B × 64 bytes = 64 GB. That fits on a single machine with 128 GB RAM, or 2 shards with comfortable headroom.
Query latency math: Encode query with BERT-base (~5ms on GPU). HNSW search over 1B vectors (~3ms with ef_search=64). Total: ~8ms. Compare to BM25: ~12ms. Dense is actually faster per-query, but the index build is enormously more expensive (1B BERT forward passes vs. tokenization + posting list construction).
The quality of your bi-encoder hinges on the quality of your negatives during training. Here is the hierarchy, from weakest to strongest:
| Strategy | Source | NDCG impact | Compute cost |
|---|---|---|---|
| Random negatives | Random documents from corpus | Baseline | Free |
| In-batch negatives | Other (query, doc) pairs in same batch | +5-8% | Free (already computed) |
| BM25 hard negatives | Top-100 BM25 results that are not labeled relevant | +8-15% | One BM25 run per query |
| Model hard negatives (ANCE) | Top-100 from the model's own ANN index, refreshed periodically | +12-20% | Rebuild ANN index every N steps |
| Cross-encoder distillation | Use cross-encoder scores to mine hardest negatives | +15-25% | Cross-encoder inference on candidates |
The ANCE (Approximate Nearest Neighbor Negative Contrastive Estimation) approach is particularly clever: periodically rebuild the ANN index using the current model, then use the model's own nearest neighbors as hard negatives. The model is forced to distinguish documents that it currently thinks are similar but are not relevant. This creates a curriculum where negatives get harder as the model improves.
Matryoshka Representation Learning (MRL) trains a single embedding model that produces useful embeddings at multiple dimensionalities — you can truncate a 768-d embedding to 128-d, 64-d, or even 32-d with graceful quality degradation. This lets you use 32-d for cheap first-pass retrieval and 768-d for reranking, from the same model. Nomic Embed and Jina Embeddings v3 (2024) both support MRL. Multi-vector approaches like ColBERTv2 store one vector per token, enabling richer similarity computation at moderate cost.
The MRL training trick is elegant: during training, apply the InfoNCE loss not just on the full 768-d embedding, but simultaneously on truncations at 512, 256, 128, 64, and 32 dimensions. Each truncation gets its own loss term. The model learns to pack the most important information into the first dimensions, with diminishing information in later dimensions — like a progressive JPEG for embeddings.
Queries and documents are projected into 2D embedding space. Drag the query to see which documents are nearest. Toggle between cosine and dot-product similarity.
Transformers revolutionized search not by replacing the pipeline, but by upgrading every stage of it. Query understanding, document encoding, retrieval, and reranking all benefit from self-attention. But the specific architecture you choose at each stage involves hard tradeoffs between accuracy, latency, and memory. This chapter maps those tradeoffs.
The simplest and most impactful application of transformers in search: take the top-100 results from BM25 or dense retrieval, and rerank them with a BERT cross-encoder. The cross-encoder sees the full (query, document) pair, so it understands context that bi-encoders miss.
Why is this so effective? Because attention is bidirectional. When BERT processes "[CLS] best noise canceling headphones [SEP] These premium over-ear headphones feature industry-leading ANC technology [SEP]", the word "headphones" in the query attends to "headphones" in the document, but also "noise canceling" attends to "ANC" — BERT learned that these are synonyms during pretraining on billions of tokens. A bi-encoder cannot do this because query and document are encoded independently.
The cost: BERT-base has 110M parameters. A single forward pass on a (query, 512-token document) pair takes ~10ms on a V100 GPU. Reranking 100 candidates: 1 second. Reranking 1000: 10 seconds. This is why cross-encoders are strictly a reranking stage, never a first-pass retrieval stage.
ColBERT (Contextualized Late Interaction over BERT) is a beautiful compromise between bi-encoders and cross-encoders. Instead of compressing each document to a single vector, ColBERT stores one vector per token. At query time, each query token computes its maximum similarity with any document token, and these maxima are summed:
This is called MaxSim. It captures token-level interactions (like a cross-encoder) but keeps document encoding offline (like a bi-encoder). The tradeoff: you store N vectors per document instead of 1, so the index is N times larger. For a 200-token average document, that is 200x more storage. ColBERTv2 mitigates this with residual compression, reducing per-token storage from 128 bytes to ~2 bytes.
python import torch import torch.nn.functional as F from transformers import BertModel, BertTokenizer class ColBERT(torch.nn.Module): def __init__(self, model_name="bert-base-uncased", dim=128): super().__init__() self.bert = BertModel.from_pretrained(model_name) self.proj = torch.nn.Linear(768, dim) # compress per-token embs def encode(self, input_ids, attention_mask): """Returns per-token embeddings: (batch, seq_len, dim)""" out = self.bert(input_ids, attention_mask).last_hidden_state embs = self.proj(out) return F.normalize(embs, dim=-1) * attention_mask.unsqueeze(-1) def maxsim(self, q_embs, d_embs): """ColBERT's MaxSim: for each query token, find the max similarity with any document token, then sum. q_embs: (batch, q_len, dim) d_embs: (batch, d_len, dim) Returns: (batch,) scalar scores """ # (batch, q_len, d_len) — similarity matrix per pair sim = torch.bmm(q_embs, d_embs.transpose(1, 2)) # Max over document tokens for each query token max_sim = sim.max(dim=-1).values # (batch, q_len) # Sum over query tokens return max_sim.sum(dim=-1) # (batch,)
Storage math for ColBERT: A 200-token document at 128 dimensions = 200 × 128 × 4 bytes = 102,400 bytes per document. For 1 billion documents: 102 TB. This is 160x more than a single-vector bi-encoder (640 GB). ColBERTv2's residual compression reduces this to ~2 bytes per token-dimension, bringing the total to ~51 GB — comparable to a bi-encoder with PQ compression. This is why ColBERTv2, not original ColBERT, is used in production.
Standard BERT has a 512-token limit. Web pages are often 5,000-50,000 tokens. You have three options:
| Strategy | How it works | Quality | Cost |
|---|---|---|---|
| Passage splitting | Split document into 512-token passages, score each, take max | Good | Multiplies inference by N passages |
| Longformer/BigBird | Sparse attention (local + global tokens), O(n) instead of O(n²) | Good | Custom architecture, harder to fine-tune |
| Hierarchical encoding | Encode passages with BERT, then aggregate with a second transformer | Best | Two-stage inference |
In practice, passage splitting with max-passage scoring dominates production systems because it is simple, parallelizable, and uses standard BERT checkpoints without modification.
Before any retrieval happens, the query must be understood and potentially expanded. This is the most underappreciated stage of the pipeline:
Spell correction: Users misspell queries 10-15% of the time. "transphormer atention" should match documents about "transformer attention." Spell correction uses an edit-distance model trained on query logs (the most common correction for "transphormer" in your logs is "transformer").
Query expansion: Add semantically related terms. "heart attack" → "heart attack myocardial infarction cardiac arrest." This improves BM25 recall by matching documents that use medical terminology instead of colloquial terms. Modern query expansion uses a neural model (T5 or GPT-2) fine-tuned to generate relevant expansion terms.
Intent classification: Classify the query into intents: navigational ("facebook login"), informational ("how does BM25 work"), transactional ("buy sony headphones"). Each intent routes to a different ranking configuration. Navigational queries use URL-match features heavily; informational queries emphasize content quality; transactional queries boost product pages.
python from transformers import AutoModelForSequenceClassification, AutoTokenizer class CrossEncoderReranker: def __init__(self, model_name="cross-encoder/ms-marco-MiniLM-L-12-v2"): self.tokenizer = AutoTokenizer.from_pretrained(model_name) self.model = AutoModelForSequenceClassification.from_pretrained(model_name) self.model.eval() def rerank(self, query: str, docs: list[str], top_k=10): pairs = [(query, doc) for doc in docs] inputs = self.tokenizer( pairs, padding=True, truncation=True, max_length=512, return_tensors="pt" ) with torch.no_grad(): scores = self.model(**inputs).logits.squeeze(-1) ranked = scores.argsort(descending=True)[:top_k] return [(docs[i], scores[i].item()) for i in ranked]
Symptom: Cross-encoder reranker assigns high scores to documents that are topically related but do not actually answer the query. User searches "when was Python 3.12 released" and the reranker favors a long Python history article over the short release announcement. Root cause: The model was fine-tuned on relevance labels that conflate "topically related" with "answers the query." Fix: Fine-tune with graded relevance labels (0=irrelevant, 1=related, 2=answers the query) and train the model to distinguish grades, not just relevant/irrelevant.
Symptom: Cross-encoder latency varies wildly (p50=30ms, p99=200ms) even with constant batch size. Root cause: Variable document lengths. A 50-token document and a 512-token document take very different amounts of time to process, and BERT's attention is O(n²). When a batch contains a mix of long and short documents, padding wastes compute, and the batch time is dominated by the longest document. Fix: Sort candidates by length and batch similar-length documents together (called "dynamic batching" or "bucketed batching"). This reduces padding waste by 30-50% and stabilizes latency.
Symptom: The cross-encoder gives suspiciously high scores to very short documents (1-2 sentences). Root cause: Short documents have less chance of containing irrelevant information, so the model learns a shortcut: short = relevant. This is a form of spurious correlation. Fix: Add document length as a feature during training (so the model can learn the true length-relevance relationship), and include negative examples that are short but irrelevant.
A 5,000-token web page cannot fit into a 512-token BERT model. Here is exactly how passage splitting works in production:
MaxP is the standard because relevant information in a document is often concentrated in one section (the introduction, or a specific paragraph). A document about "attention mechanisms in transformers" might have 18 passages about other transformer components, but the one passage about attention will score very high — and MaxP captures it.
The production architecture at scale uses 3-4 stages, each trading coverage for accuracy:
This example shows exactly why cross-encoders are more accurate and why that accuracy costs so much more:
The latency difference is stark. For 100 candidate documents:
Differentiable Search Index (DSI) replaces the entire retrieve-then-rerank pipeline with a single transformer that directly generates document identifiers given a query. Instead of embedding documents into a vector space and doing ANN search, the model memorizes the entire corpus and "retrieves" by generating a structured document ID token-by-token. Google's GENRET and Meta's SEAL (2024) demonstrate this at moderate scale (millions of documents). The open question: can this scale to billions? Current evidence says not yet, but the trajectory is clear.
GritLM (Generative Representational Instruction-Tuned Language Model, 2024) unifies embedding and generation in a single model. It can produce high-quality embeddings (competitive with specialized bi-encoders) AND generate text (competitive with instruction-tuned LLMs) from the same weights. This hints at a future where the retrieval model and the generation model are the same model, eliminating the RAG pipeline entirely.
Watch candidates get filtered through each ranking stage. Numbers show candidates remaining at each stage. Click stages to see scoring details.
Search requires a query. Recommendations do not. When you open YouTube and see a personalized feed, no explicit query was issued — the system inferred what you want from your history, demographics, and the behavior of similar users. Yet under the hood, recommendation and search share the same fundamental structure: score items against a context, return the top-k. The "context" is just different — an explicit query string vs. an implicit user profile.
The oldest and most intuitive recommendation idea: if users A and B both liked items 1, 2, and 3, and user A also liked item 4, then user B will probably like item 4 too. This is collaborative filtering (CF) — filtering items based on the collective behavior of users.
Mathematically, you have a user-item interaction matrix R (users × items). Most entries are missing (a user has interacted with a tiny fraction of items). CF fills in the missing entries. The most successful approach: matrix factorization. Decompose R ≈ U · VT, where U is (users × k) and V is (items × k). Each user and item gets a k-dimensional embedding. The predicted rating for user u on item i is:
where bu and bi are user and item biases, and μ is the global mean. This is exactly what won the Netflix Prize in 2009, and the core idea (embed users and items into a shared space, predict via inner product) remains the foundation of every modern neural recommender.
The key to understanding collaborative filtering is the sparsity problem. A typical user-item matrix has 99.9% missing entries. A user on Netflix has rated maybe 200 of 10,000 movies. The matrix factorization must generalize from the 0.1% of observed entries to predict the 99.9% of unobserved entries. This works because the latent space is low-dimensional (k=50-200), acting as a strong regularizer — the model is forced to find simple patterns (like "action lover" or "indie film buff") rather than memorizing individual entries.
The training procedure: minimize the squared error between observed ratings and predictions, with L2 regularization on the embeddings to prevent overfitting:
Optimization: SGD (iterate over observed entries, update user and item embeddings) or ALS (Alternating Least Squares — fix user embeddings and solve for items, then fix items and solve for users; this is parallelizable and popular in Spark MLlib).
Collaborative filtering has a fatal flaw: the cold-start problem. A new item with zero interactions cannot be recommended (no column in the matrix). A new user with no history gets no personalization. Content-based filtering solves this by using item features (text, images, categories) instead of interaction signals. You build an item profile from its features and a user profile from the features of items they liked.
The modern implementation: use a pretrained encoder (BERT for text, CLIP for images) to embed item content. The user profile is the mean (or attention-weighted combination) of their liked items' embeddings. Recommendation is nearest-neighbor search in this embedding space — exactly the same infrastructure as dense retrieval.
Content-based filtering has another advantage: explainability. You can tell the user "We recommended this because it is similar to [item you liked]." With collaborative filtering, you can only say "Users similar to you liked this" — less satisfying and less actionable for the user.
The hybrid approach combines both: use collaborative filtering for users with rich history (many interactions), fall back to content-based for new users or new items. The transition is gradual — as a new user accumulates more interactions, the system progressively weights collaborative filtering more heavily. A common implementation: the ranking model has features from both CF (embedding similarity) and CB (content overlap), and the LTR model learns the optimal weighting automatically.
The two-tower model (also called dual-encoder) is the dominant architecture in industrial recommendation. One tower encodes the user (history, demographics, context), the other encodes the item (features, metadata). Both produce fixed-size vectors, and the score is their inner product:
This is architecturally identical to a bi-encoder for search. The user tower = query encoder. The item tower = document encoder. The training is the same (contrastive loss with in-batch negatives). The serving is the same (pre-compute item embeddings, ANN search at query time). This is why search and recommendations are converging.
The power of this architecture: at serving time, the item tower runs once offline (pre-compute all item embeddings), and the user tower runs once per request (embed the user's current context). Recommendation becomes an ANN lookup: find the K items whose embeddings are closest to the user embedding. This scales to billions of items because ANN search is sublinear.
python import torch import torch.nn as nn import torch.nn.functional as F class TwoTower(nn.Module): def __init__(self, n_users, n_items, n_feats, dim=64): super().__init__() # User tower: embed user ID + context features self.user_emb = nn.Embedding(n_users, dim) self.user_mlp = nn.Sequential( nn.Linear(dim + n_feats, 128), nn.ReLU(), nn.Linear(128, dim) ) # Item tower: embed item ID + content features self.item_emb = nn.Embedding(n_items, dim) self.item_mlp = nn.Sequential( nn.Linear(dim + n_feats, 128), nn.ReLU(), nn.Linear(128, dim) ) def forward(self, user_ids, user_feats, item_ids, item_feats, tau=0.1): u = self.user_mlp(torch.cat([self.user_emb(user_ids), user_feats], -1)) v = self.item_mlp(torch.cat([self.item_emb(item_ids), item_feats], -1)) u = F.normalize(u, dim=-1) v = F.normalize(v, dim=-1) sim = u @ v.T / tau # (B, B) labels = torch.arange(len(u), device=sim.device) return F.cross_entropy(sim, labels) # same InfoNCE as search!
Let us trace one step of SGD for matrix factorization. This builds intuition for what the embedding vectors actually represent.
After thousands of SGD steps, the user and item embeddings converge so that uiT · vj ≈ rij for all observed ratings. The magic: unobserved ratings (the zeros) also get predicted values, giving us recommendations.
Symptom: Your recommender keeps suggesting the same 100 popular items to everyone. Long-tail items never surface. Root cause: Popular items have orders of magnitude more positive interactions, so their embeddings are better trained and closer to most user embeddings. Fix: (1) Popularity-weighted sampling during training — downsample popular items, upsample rare ones. (2) Add an explicit diversity objective during serving (Maximal Marginal Relevance, MMR). (3) Calibration: adjust scores by item popularity to decouple quality from exposure.
Symptom: New users get terrible recommendations. Root cause: Cold start. The user tower has no history to work with. Fix: Fall back to content-based recommendations for the first N interactions. Use demographic features in the user tower. Show an explicit preference quiz ("pick 5 topics you like").
At scale, recommendation serving mirrors search serving: pre-compute item embeddings, serve from ANN index, add a lightweight reranking stage for personalization and business rules. The difference: recommendation must handle real-time user signals (they just clicked on X, immediately update what we show). This requires a feature store with sub-millisecond reads and a near-real-time feature pipeline that updates user features within seconds of an interaction.
No production recommender uses a single approach. They all use hybrid architectures that combine multiple signals:
python import torch import torch.nn as nn class CrossLayer(nn.Module): """A single cross layer: captures feature interactions.""" def __init__(self, dim): super().__init__() self.W = nn.Linear(dim, dim, bias=False) self.b = nn.Parameter(torch.zeros(dim)) def forward(self, x0, x): # x0 is the original input, x is current layer output return x0 * self.W(x) + self.b + x # element-wise cross class DCNv2(nn.Module): """Deep & Cross Network V2 for ranking.""" def __init__(self, input_dim, cross_layers=3, deep_dims=[256, 128]): super().__init__() # Cross network: explicit feature interactions self.crosses = nn.ModuleList([CrossLayer(input_dim) for _ in range(cross_layers)]) # Deep network: implicit interactions via MLP layers = [] prev = input_dim for d in deep_dims: layers += [nn.Linear(prev, d), nn.ReLU(), nn.BatchNorm1d(d)] prev = d self.deep = nn.Sequential(*layers) # Combine and predict self.out = nn.Linear(input_dim + deep_dims[-1], 1) def forward(self, x): # Cross tower x_cross = x for cross in self.crosses: x_cross = cross(x, x_cross) # Deep tower x_deep = self.deep(x) # Combine combined = torch.cat([x_cross, x_deep], dim=-1) return torch.sigmoid(self.out(combined)) # P(click)
Real-world recommendation systems optimize multiple objectives simultaneously. You do not just want clicks — you want clicks that lead to satisfaction, purchases, retention, and responsible content exposure:
| Objective | Why it matters | Proxy metric |
|---|---|---|
| Engagement | Users interact with the content | P(click), P(like), P(comment) |
| Satisfaction | Users get value from what they consume | Dwell time, session length, return rate |
| Revenue | Platform generates income | P(purchase), ad click-through |
| Responsibility | Avoid harmful, misleading, or low-quality content | P(report), P(hide), quality score |
| Diversity | Prevent filter bubbles, surface new interests | Topic diversity, novelty score |
The standard approach: train separate models for each objective, then combine with a weighted scalarization:
The weights are set by the product team (not ML) and encode business priorities. During elections, w4 (responsibility) might increase 5x. During holiday shopping season, w3 (revenue) gets a boost. These weights are the lever between ML and product strategy.
LLM-based recommendations are an active frontier. Approaches like P5 and InstructRec frame recommendation as text generation: "Based on the user's history [item1, item2, item3], recommend the next item." These can handle cold start (they understand item descriptions) and transfer across domains. UniRec (2024) unifies retrieval, ranking, and generation in a single model. The tradeoff: LLM inference is 100-1000x slower than a two-tower dot product, making them impractical for first-stage retrieval but promising for reranking or explanation generation.
Sequence-based recommenders (SASRec, BERT4Rec) model the user's interaction history as a sequence and predict the next item using transformer self-attention. The user's last 50 interactions become a "sentence" and the next item is the "next word." This captures temporal dynamics (what the user is interested in RIGHT NOW, not just historically) and achieves 5-15% improvement over static two-tower models. The cost: sequence encoding at query time (cannot be pre-computed), adding ~10ms per user. Production systems use distilled 2-4 layer transformers for this.
A user-item matrix where colors indicate ratings. Watch matrix factorization decompose it into user and item embeddings. Missing entries (gray) get predicted values.
Training a BERT-base model on a single GPU takes about 4 days on Wikipedia alone. Training a search ranking model on 10 billion query-document pairs? Months on a single GPU. Web-scale AI demands distributed training — spreading the work across hundreds or thousands of GPUs. But distributing training is not just "run the same code on more machines." It introduces synchronization bottlenecks, communication overhead, memory constraints, and subtle numerical issues that can silently degrade your model.
Training a search model is fundamentally different from training a language model. Language models train on text-only data with a simple next-token prediction objective. Search models train on (query, document, label) triples with contrastive objectives, where the quality of negatives matters as much as the quality of positives, and where training data has complex biases (position bias, selection bias, popularity bias) that must be corrected.
The key constraint at web scale: your training data is so large that you cannot afford to iterate over it many times. A 10B-pair training set at 100ms/batch takes 7 days for one epoch on 128 GPUs. Training for 10 epochs would take 70 days — unacceptable for a team shipping monthly model updates. This means every training decision must be right the first time: learning rate schedule, negative mining strategy, data sampling, and early stopping criteria.
The simplest form of distributed training: every GPU has a full copy of the model. Each GPU processes a different mini-batch. After the forward-backward pass, all GPUs average their gradients (the all-reduce operation) and apply the same update. This is equivalent to training with a larger batch size.
The bottleneck: all-reduce communication. For a 110M-parameter model (BERT-base), each gradient is 440 MB (110M params × 4 bytes). With 8 GPUs, you must send 440 MB across the network after every step. The ring all-reduce algorithm distributes this: each GPU sends and receives (2 × 440 / 8) = 110 MB per step, overlapped with computation. With a fast interconnect (NVLink: 600 GB/s), the overhead is ~1ms per step. With Ethernet (25 Gbps): ~140ms. This is why GPU clusters use NVLink intra-node and InfiniBand inter-node.
What if the model does not fit on a single GPU? A 7B-parameter model in FP16 takes 14 GB — fits on an A100 (80 GB). A 70B model takes 140 GB — does not fit. Model parallelism splits the model across GPUs:
| Type | How it splits | When to use |
|---|---|---|
| Tensor parallelism (TP) | Split weight matrices within a layer across GPUs. E.g., a 4096×4096 matrix split across 4 GPUs: each holds 4096×1024 | Within a node (requires fast interconnect: all-to-all every layer) |
| Pipeline parallelism (PP) | Assign different layers to different GPUs. GPU 0 runs layers 1-6, GPU 1 runs layers 7-12, etc. | Across nodes (only sends activations between stages, less comm) |
| Sequence parallelism (SP) | Split the sequence dimension. Each GPU processes a chunk of the sequence | For very long sequences (needed with 100K+ context) |
Mixed precision uses FP16 for forward/backward passes and FP32 for weight updates. This halves memory usage and doubles throughput on GPUs with tensor cores. The key trick: keep a master copy of weights in FP32. After computing FP16 gradients, cast them to FP32, update the FP32 master weights, then cast back to FP16 for the next forward pass. This prevents the accumulation of small gradient updates that FP16 precision would round to zero. Loss scaling multiplies the loss by a large constant (e.g., 1024) before the backward pass to prevent FP16 gradient underflow, then divides after.
Let us compute the exact GPU memory required to train a BERT-base model. This calculation comes up in system design interviews when they ask "how many GPUs do you need?"
python import torch import torch.distributed as dist from torch.nn.parallel import DistributedDataParallel as DDP from torch.cuda.amp import autocast, GradScaler def train(rank, world_size): dist.init_process_group("nccl", rank=rank, world_size=world_size) torch.cuda.set_device(rank) model = BiEncoder(encoder=BertModel.from_pretrained("bert-base")) model = model.cuda(rank) model = DDP(model, device_ids=[rank]) # wrap for data parallelism optimizer = torch.optim.AdamW(model.parameters(), lr=2e-5) scaler = GradScaler() # for mixed precision for batch in dataloader: optimizer.zero_grad() with autocast(dtype=torch.float16): # FP16 forward loss = model(**batch) scaler.scale(loss).backward() # FP16 backward + loss scaling scaler.step(optimizer) # FP32 update scaler.update()
Symptom: Loss spikes or diverges when scaling from 8 to 64 GPUs. Root cause: The effective batch size jumped 8x (from 256 to 2048). Large batches require a learning rate warmup — without it, the initial gradients are too noisy and the model overshoots. Fix: Apply the linear scaling rule: when you multiply batch size by K, multiply learning rate by K. But only after a warmup period (typically 1-10% of total steps) where the LR linearly increases from 0 to the target.
Symptom: Training loss decreases but validation metrics plateau. Root cause: With very large batch sizes (>8192), the model converges to sharper minima that generalize worse. Fix: Use LAMB (Layerwise Adaptive Moments for Batch training) optimizer, which adapts the learning rate per layer. Or: add gradient noise. Or: reduce effective batch size with gradient accumulation instead of pure data parallelism.
Symptom: FP16 training suddenly produces NaN losses after 10K steps (was fine before). Root cause: Gradient explosion in FP16 — a gradient value exceeded FP16 max (65504) and became Inf, which propagated through the network as NaN. This often happens when the learning rate is too high or when the loss landscape has sharp cliffs (common in contrastive learning). Fix: (1) Reduce learning rate. (2) Enable gradient clipping (max_norm=1.0). (3) Use BF16 instead of FP16 — BF16 has the same exponent range as FP32 (max value 3.4e38 vs FP16's 65504), preventing overflow at the cost of slightly lower precision.
Symptom: Model performance differs between single-GPU and multi-GPU training. Root cause: Batch normalization statistics computed per-GPU diverge from global statistics. With 32 samples per GPU, the batch mean/variance is noisy. Fix: Use SyncBatchNorm which computes statistics across all GPUs. Or replace BatchNorm with LayerNorm (which is batch-size independent) — this is why modern transformers use LayerNorm exclusively.
Training a search model on 10B pairs with a batch size of 4096, using 128 A100 GPUs: ~50,000 steps, ~6 hours wall-clock time. The cluster needs: (1) a fast distributed filesystem for training data (HDFS or cloud object storage with aggressive prefetching), (2) NVLink within nodes + InfiniBand between nodes, (3) checkpointing every 1000 steps to distributed storage, (4) automatic failure recovery (when a GPU dies mid-training, restart from the last checkpoint on a spare node).
Sometimes you need a large effective batch size but cannot afford more GPUs. Gradient accumulation simulates a larger batch by accumulating gradients over multiple micro-batches before taking an optimizer step:
python # Simulate batch_size=2048 on a single GPU with batch_size=32 accumulation_steps = 2048 // 32 # = 64 micro-batches optimizer.zero_grad() for i, batch in enumerate(dataloader): with autocast(dtype=torch.float16): loss = model(**batch) / accumulation_steps # normalize loss scaler.scale(loss).backward() # accumulate gradients if (i + 1) % accumulation_steps == 0: scaler.step(optimizer) # update weights scaler.update() optimizer.zero_grad() # reset for next accumulation
The trade-off: gradient accumulation gives you the same gradients as a large batch, but N times slower (you process N micro-batches sequentially instead of in parallel). It also does not help with in-batch negatives for contrastive learning — each micro-batch only has 32 negatives, not 2048. For contrastive training, you need actual data parallelism.
DeepSpeed ZeRO-3 shards optimizer states, gradients, AND parameters across GPUs, enabling training of models that are 8x larger than GPU memory allows. FSDP (Fully Sharded Data Parallel) in PyTorch provides similar functionality natively. Ring Attention (2024) enables training on sequences of arbitrary length by distributing the sequence across GPUs in a ring topology. GaLore (2024) reduces optimizer memory by projecting gradients into a low-rank subspace, enabling full-parameter fine-tuning of 7B models on a single 24GB GPU.
Cross-batch negatives (2024) solve the in-batch negative limitation: store embeddings from previous micro-batches in a queue and use them as negatives. This gives you 10,000+ negatives even with a micro-batch of 32, eliminating the need for large actual batch sizes in contrastive training. The memory overhead is trivial (a queue of embeddings) and the quality improvement is substantial (5-10% recall gain over standard in-batch negatives).
Adjust the number of GPUs and batch size to see how training throughput scales. Watch for the communication bottleneck as GPU count increases.
Training the best model in the world is worthless if you cannot serve it within a 200ms latency budget at 50,000 queries per second. Serving is where theory meets physics — memory bandwidth, network latency, cache hit rates, and tail latency dominate your engineering decisions. This chapter is about making web-scale AI actually work in production.
A search result page must load in under 200ms for the user to perceive it as "instant." Here is how that budget is typically allocated:
| Stage | Budget | What happens |
|---|---|---|
| Network roundtrip | 20-50ms | User → edge → datacenter → edge → user |
| Query parsing + expansion | 5ms | Spell correct, synonym expand, intent classify |
| BM25 retrieval | 10-15ms | Inverted index lookup across shards |
| ANN retrieval | 10-15ms | Dense embedding search across shards |
| Feature enrichment | 5ms | Fetch precomputed features from feature store |
| Neural reranking | 30-50ms | Cross-encoder on top-200 candidates (GPU batch) |
| Result assembly | 5-10ms | Blend, diversify, personalize, serialize |
| Total | 85-150ms | Leaves 50-115ms margin for p99 variance |
The critical insight: retrieval and reranking run in series, but BM25 and ANN retrieval run in parallel. If either takes too long, the whole pipeline misses the budget. This is why p99 latency (not p50) is the metric that matters — one slow shard delays everything.
Vector quantization compresses embedding vectors to reduce memory and speed up distance computation. Product Quantization (PQ) divides a 128-d vector into 16 sub-vectors of 8 dimensions each, then quantizes each sub-vector to its nearest centroid from a codebook of 256 entries. The compressed representation: 16 bytes (one byte per sub-vector). This is a 32x compression from the original 512 bytes (128 × 4 bytes for FP32).
Model quantization compresses the neural reranker. INT8 quantization reduces model size by 4x and doubles throughput on GPUs with INT8 tensor cores (A100, H100). The accuracy loss is typically <0.5% NDCG. For even more aggressive compression, GPTQ and AWQ (2024) achieve 4-bit quantization with <1% quality loss.
Let us trace Product Quantization step by step. This is a favorite interview question: "explain PQ to me like I'm implementing it."
The quality loss from PQ depends on K and the number of sub-vectors (m). Typical configurations:
| Config | Bytes/vector | Compression | Recall@100 loss |
|---|---|---|---|
| PQ16 (m=16, K=256) | 16 | 32× | 3-8% |
| PQ32 (m=32, K=256) | 32 | 16× | 1-4% |
| PQ64 (m=64, K=256) | 64 | 8× | 0.5-2% |
| OPQ64 (optimized rotation + PQ64) | 64 | 8× | 0.3-1% |
Search queries follow a power law: the top 1% of queries account for 30-40% of traffic. Caching these "head queries" eliminates repeated computation. Multi-level caching:
Result cache: Cache the final rendered result page. Hit rate: 15-25% for web search, 40-60% for e-commerce (fewer unique queries). TTL: 5-60 minutes depending on freshness requirements.
Embedding cache: Cache query embeddings. An embedding takes ~5ms to compute; a cache hit takes ~0.1ms. Hit rate: 30-50% (many queries are near-duplicates that map to similar embeddings).
Feature cache: Cache document features. A feature store lookup takes 1-2ms; a cache hit takes ~0.05ms. Hit rate: 80-95% (popular documents appear in many results).
How you partition the index across machines affects both throughput and tail latency:
Document-based sharding: Hash each document ID to a shard. Every query hits ALL shards (scatter-gather). Pros: even load distribution, simple to scale horizontally. Cons: every query has fan-out = shard count, amplifying tail latency.
Term-based sharding: Assign posting lists by term. A query for "transformer attention" hits only the shards containing those terms (2-3 shards, not 64). Pros: lower fan-out, less tail latency. Cons: hot terms ("the", "is") create load imbalance; skewed traffic patterns.
Tiered sharding: The production approach at Google/Bing. Divide the index into tiers by document quality (PageRank). Tier 0 (top 1% of documents by quality) fits on fewer machines and is always searched. Tier 1 (next 10%) is searched only if Tier 0 does not return enough high-quality results. Tier 2 (everything else) is only searched for unusual queries. This reduces average fan-out by 3-5× while maintaining coverage.
At 50K QPS with 64-shard scatter-gather, tail latency dominates the user experience. Here is the complete toolkit for taming it:
| Technique | How it works | Latency reduction | Cost |
|---|---|---|---|
| Hedged requests | Send to 2 replicas, take fastest | p99 → ~p50 | 2× read traffic |
| Request deadlines | If a shard does not respond in 30ms, skip it (return partial results) | Hard cap on tail | Occasional incomplete results |
| Canary queries | Before routing traffic, probe each shard's health with a synthetic query | Avoids known-slow shards | Small overhead per shard |
| Load-aware routing | Route to the replica with the shortest queue, not round-robin | 30-50% p99 reduction | Requires load reporting |
| Over-provisioning | Keep 20% spare capacity for handling load spikes | Prevents saturation-induced latency | 20% more machines |
In practice, all five techniques are used simultaneously. The most impactful single technique is hedged requests, which converts tail latency from a function of the slowest shard to a function of the second-slowest shard — a dramatic improvement when slowness is caused by transient issues (GC pauses, page cache misses, network congestion).
python import faiss import numpy as np def build_index(embeddings: np.ndarray, nlist=4096, m=64, nbits=8): """Build an IVF-PQ index for billion-scale search. embeddings: (N, dim) float32 array nlist: number of IVF clusters m: number of PQ sub-quantizers (compressed size = m bytes) nbits: bits per sub-quantizer (256 centroids = 8 bits) """ dim = embeddings.shape[1] # Define the compound index: IVF for partitioning + PQ for compression quantizer = faiss.IndexFlatL2(dim) # coarse quantizer index = faiss.IndexIVFPQ(quantizer, dim, nlist, m, nbits) index.train(embeddings[:min(1_000_000, len(embeddings))]) # train on sample index.add(embeddings) # add all vectors index.nprobe = 64 # search 64 of 4096 clusters return index # ~m bytes per vector (vs 4*dim for full float32) def search(index, query_emb: np.ndarray, top_k=100): distances, indices = index.search(query_emb, top_k) return indices[0], distances[0] # top-k doc IDs and distances
Symptom: p99 latency spikes from 150ms to 800ms every hour for 5 minutes. Root cause: Garbage collection pauses on JVM-based services (Elasticsearch, Lucene). Fix: Tune GC settings (use G1GC with max pause target of 50ms), or migrate hot paths to non-GC languages (C++/Rust for the ANN index, which is what FAISS does).
Symptom: ANN latency doubles after an index rebuild. Root cause: The new index is loaded into cold memory; the OS page cache has not warmed up yet. Fix: Pre-warm the index by issuing synthetic queries before routing live traffic to the new shard. Or use a stale-while-reindex pattern: keep serving from the old index while the new one warms up.
Symptom: p50 latency is 80ms (good) but p99 is 450ms (terrible). The gap is 5.6x. Root cause: Fan-out amplification. With 64 BM25 shards, the p99 of the request is approximately the p99 of the slowest shard. If each shard has p99 = 30ms, the probability that ALL 64 shards respond within 30ms is (0.99)^64 = 0.52. So 48% of requests wait for a slow shard. Fix: Hedged requests. Send each shard request to 2 replicas; take whichever responds first. This converts each shard's distribution from p99 to approximately p50 at the cost of 2x the read traffic. Google's "The Tail at Scale" paper (Dean & Barroso, 2013) is the canonical reference.
python import asyncio from typing import Any async def hedged_request( replicas: list[str], # list of replica endpoints query: dict, timeout_ms: float = 50, # hedge after this timeout ) -> Any: """Send to primary replica. If no response in timeout_ms, send to secondary. Return whichever responds first.""" async def fetch(endpoint): return await http_client.post(endpoint, json=query) # Start primary request primary = asyncio.create_task(fetch(replicas[0])) # Wait up to timeout_ms done, pending = await asyncio.wait( [primary], timeout=timeout_ms / 1000 ) if done: return primary.result() # primary was fast enough # Primary is slow — hedge with secondary secondary = asyncio.create_task(fetch(replicas[1])) done, pending = await asyncio.wait( [primary, secondary], return_when=asyncio.FIRST_COMPLETED ) result = done.pop().result() for p in pending: p.cancel() # cancel the slower one return result
50K QPS across a 10-billion-document index. Each query hits: (1) BM25 scatter-gather across 64 shards, (2) ANN scatter-gather across 16 shards, (3) feature store batch read for 200 candidates, (4) GPU reranking on a pool of 8 A100s. Total machines: ~200 (with replication). Total cost: ~$500K/month cloud. The biggest cost driver is GPU reranking; the biggest reliability risk is tail latency on the scatter-gather fan-out.
An interviewer asks: "Estimate the infrastructure cost for a 10B-document search system at 50K QPS." Walk through it methodically:
A web index must be continuously updated as new pages are crawled and old pages change. But rebuilding the entire index (50B documents) takes hours. Three strategies for keeping the index fresh:
Batch rebuild: Rebuild the full index daily. Simple, but 24-hour staleness. Good enough for most web search (news vertical is the exception).
Incremental update: Add new documents to the existing index without full rebuild. For inverted indexes, append new postings. For HNSW vector indexes, insert new nodes (HNSW supports incremental insertion). For IVF-PQ, assign new vectors to the nearest existing centroid. Quality degrades slowly as the centroid distribution drifts; retrain centroids weekly.
Lambda architecture: Maintain two indexes: a large "batch" index (rebuilt daily) and a small "real-time" index (updated within minutes). At query time, search both and merge results. This gives sub-hour freshness with the quality of a full rebuild. The real-time index is small (<1% of documents) and fits on a single machine.
cuVS (NVIDIA's CUDA Vector Search, 2024) runs IVF-PQ and CAGRA (a GPU-native graph index) entirely on GPU, achieving 10-100x speedup over CPU FAISS for billion-scale indexes. Milvus 2.4 and Weaviate integrate GPU-accelerated ANN. The emerging pattern: keep the ANN index on GPU alongside the reranking model, eliminating CPU-GPU data transfer. This collapses the retrieval + reranking latency from 60ms to 15ms.
Scalar quantization (SQ8/SQ4) is emerging as a simpler alternative to Product Quantization. Instead of the complex PQ codebook, just round each FP32 dimension to INT8 (4x compression) or INT4 (8x compression). Quality loss is comparable to PQ at similar compression ratios, but SQ is dramatically simpler to implement and faster to encode. Combined with GPU-native indexes, SQ4 + CAGRA can search 1 billion vectors in under 2ms.
Adjust the latency budget for each pipeline stage. The bar chart shows cumulative latency vs. the 200ms deadline. Red = over budget.
So far we have scoring functions (BM25, bi-encoder similarity, cross-encoder relevance) that each produce a number for a (query, document) pair. But search results are a list, not independent scores. A result at position 1 matters far more than a result at position 10. Learning to Rank (LTR) is the discipline of directly optimizing the ordering of results, not just individual relevance scores.
Pointwise: Treat ranking as regression. Predict the relevance score for each (query, document) pair independently, then sort by predicted score. Simple, but ignores the fact that what matters is the relative ordering, not the absolute score. A model that predicts relevance 0.8 for all documents produces a random ranking — useless even though each prediction is "close."
Pairwise: For each pair of documents (di, dj) under the same query, predict which one is more relevant. The loss penalizes incorrect orderings. The most famous pairwise method is RankNet, which uses a cross-entropy loss on the probability that di should rank above dj:
where σ is the sigmoid function and yij = 1 if di should rank above dj.
Listwise: Directly optimize a list-level metric like NDCG. The challenge: NDCG involves sorting, which is not differentiable. LambdaMART solves this elegantly — it computes the gradient of the pairwise loss weighted by the NDCG gain from swapping each pair. Documents whose swap would most improve NDCG get the strongest gradient signal. LambdaMART (implemented in XGBoost/LightGBM) is the most widely deployed LTR algorithm in production search.
Normalized Discounted Cumulative Gain (NDCG) is the standard metric for ranking quality. It measures how well the ranking puts the most relevant documents at the top:
where IDCG is the DCG of the ideal ranking (perfectly sorted by relevance). The log denominator discounts positions: position 1 gets full credit, position 10 gets ~30% credit. This captures user behavior — people look at the first few results carefully and barely glance at results on page 2.
python import lightgbm as lgb import numpy as np # Features: BM25, bi-encoder sim, doc freshness, click rate, etc. # Labels: relevance grades (0=irrelevant, 1=fair, 2=good, 3=perfect) # Groups: number of docs per query (for listwise optimization) train_data = lgb.Dataset( X_train, label=y_train, group=group_train, feature_name=["bm25", "dense_sim", "freshness", "click_rate", "doc_length", "title_match"] ) params = { "objective": "lambdarank", # listwise LTR "metric": "ndcg", "ndcg_eval_at": [1, 5, 10], "num_leaves": 255, "learning_rate": 0.05, "min_data_in_leaf": 50, "lambdarank_truncation_level": 20, # only optimize top-20 } model = lgb.train(params, train_data, num_boost_round=1000, valid_sets=[val_data], callbacks=[lgb.early_stopping(50)])
Let us compute NDCG@5 for a specific ranking. This is a whiteboard favorite.
Notice: the ranking puts a relevant doc (rel=3) at position 5, wasting it. If we swapped positions 2 and 5, NDCG would jump because position 2 has a higher discount factor. This is exactly the insight that LambdaMART exploits — it computes the NDCG delta for every possible swap and focuses the gradient on swaps that matter most.
Symptom: LTR model achieves NDCG@10 of 0.95 on test set but only 0.72 in production. Root cause: Feature leakage — the training data includes click-through rate, which is a lagging indicator of relevance. A document shown at position 1 gets high CTR regardless of relevance (position bias). The model learns to rank already-high-position documents higher, creating a feedback loop. Fix: Use position-debiased CTR (divide by position-specific baseline CTR), or train on randomized traffic where positions are shuffled.
A production LTR model typically uses 100-500 features. They fall into categories:
| Category | Example features | Relative importance |
|---|---|---|
| Query-Doc Match | BM25, dense similarity, title match, URL match, exact phrase match | Highest (40-50% of model gain) |
| Document Quality | PageRank, spam score, freshness (days since update), content length, readability score | High (20-30%) |
| User Context | Search history, geographic location, device type, time of day, language | Medium (10-15%) |
| Interaction Signals | Historical CTR, average dwell time, long-click rate, reformulation rate | Medium (10-15%) |
| Cross Features | Query length × doc length, query category × doc category, user expertise × doc difficulty | Low but valuable (5-10%) |
The most important features are almost always BM25 and dense similarity — the rest provide 5-10% incremental improvement. But that incremental improvement matters enormously at web scale: 5% NDCG improvement translates to millions more satisfied users per day.
python import numpy as np from collections import defaultdict class PositionBasedModel: """Estimate position-debiased relevance from click logs.""" def __init__(self, max_pos=10): self.exam_prob = np.ones(max_pos) # P(examine | position) self.rel_counts = defaultdict(lambda: [0, 0]) # [clicks, impressions] def fit(self, click_log, iterations=20): """EM algorithm: alternate between estimating examination and relevance probabilities.""" for _ in range(iterations): # E-step: estimate P(relevant | click, pos) pos_clicks = np.zeros(len(self.exam_prob)) pos_imps = np.zeros(len(self.exam_prob)) for query, doc, pos, clicked in click_log: if pos >= len(self.exam_prob): continue pos_imps[pos] += 1 if clicked: pos_clicks[pos] += 1 self.rel_counts[(query, doc)][0] += 1 self.rel_counts[(query, doc)][1] += 1 # M-step: update examination probabilities self.exam_prob = np.clip(pos_clicks / (pos_imps + 1e-8), 0.01, 1.0) def debiased_relevance(self, query, doc, pos): """P(relevant | doc) = P(click) / P(examine | pos)""" clicks, imps = self.rel_counts.get((query, doc), [0, 1]) raw_ctr = clicks / max(1, imps) return raw_ctr / max(0.01, self.exam_prob[min(pos, len(self.exam_prob)-1)])
One of the most powerful aspects of tree-based LTR is automatic feature interaction discovery. A decision tree naturally captures interactions like "if BM25 > 5 AND freshness < 7 days, then boost score." These interactions are hard to specify manually but critical for ranking quality.
Here are the top feature interactions typically learned by LambdaMART, in order of importance:
| Interaction | What it captures | Example |
|---|---|---|
| BM25 × title_match | Exact query-title alignment signals navigational intent | "wikipedia" + title contains "Wikipedia" → rank #1 |
| freshness × query_type | News queries need recent docs; evergreen queries do not | "election results 2024" needs docs from today |
| dense_sim × doc_quality | Semantically similar but low-quality docs should be demoted | Spam farm copies high-quality content |
| CTR × position_seen | High CTR at low positions is a stronger signal than high CTR at position 1 | A doc clicked 30% of times shown at position 8 is excellent |
| doc_length × query_length | Long queries (specific) match better with long docs (detailed) | "how to fix React useState infinite loop" → StackOverflow answer |
Traditional LTR uses gradient-boosted trees (LambdaMART). The frontier: transformer-based LTR that takes the full list of candidates as input and produces a permutation. Models like SetRank and PiRank use self-attention across candidates to capture inter-document dependencies (e.g., two documents from the same source should not both appear at top). ListT5 (2024) uses a seq2seq model to directly generate the ranked list as a sequence of document IDs, optimized with REINFORCE.
The most practical frontier advancement is distillation from cross-encoders into LambdaMART. You run a cross-encoder on your training queries to generate soft relevance labels (much richer than binary click/no-click), then train LambdaMART on these soft labels. This gives you cross-encoder quality at tree-inference speed — the best of both worlds. Google's 2024 paper on "teacher-student ranking" showed this approach closes 60-70% of the gap between LambdaMART and cross-encoder reranking while adding zero serving latency.
Drag documents to reorder them. Relevance grades are shown (3=perfect, 0=irrelevant). NDCG updates in real-time as you reorder. Try to maximize NDCG@5.
You deployed a new ranking model. NDCG@10 improved 3% on your offline test set. But is it actually better for users? Offline metrics can lie — they are computed on a static dataset that may not represent real user behavior. Online evaluation (A/B testing) is the ground truth, but it is expensive and slow. This chapter covers the full evaluation stack: offline metrics, online experiments, and the dangerous gap between them.
MRR (Mean Reciprocal Rank): For queries with a single correct answer (navigational queries), MRR measures 1/rank of the first correct result, averaged across queries. If the correct result is at position 3, the reciprocal rank is 1/3. MRR is the right metric when users want ONE thing.
MAP (Mean Average Precision): For queries with multiple correct results, MAP averages the precision at each relevant position. If relevant documents are at positions 1, 3, and 7, the average precision is (1/1 + 2/3 + 3/7) / 3 = 0.72. MAP penalizes relevant results that appear after irrelevant ones.
NDCG@k: As covered in Chapter 7, NDCG handles graded relevance (not just binary relevant/irrelevant). It is the standard metric for web search because relevance is naturally graded — a perfect answer is better than a good answer, which is better than a tangentially related document.
Offline metrics tell you how the model performs on historical data. Online A/B testing tells you how it performs on real users. You split live traffic 50/50 between the old model (control) and the new model (treatment). After 1-2 weeks (depending on traffic volume), you compare key metrics:
| Metric | What it captures | Typical improvement bar |
|---|---|---|
| CTR | Users click something — basic engagement | +0.5% |
| Dwell time | Users spend time on clicked results — result quality | +1% |
| Reformulation rate | Users rephrase their query — indicates dissatisfaction | -0.5% |
| Session success rate | Users find what they need (no further searches) | +0.3% |
| Long-click rate | Users click and stay >30 seconds — strongest quality signal | +0.2% |
Interleaving is a more sensitive online evaluation method than A/B testing. Instead of showing each user results from only one model, you merge results from both models into a single list (using balanced interleaving or team-draft interleaving). You then measure which model's results get more clicks. Interleaving detects preferences with 10-100x less traffic than A/B testing because every user is exposed to both models.
Before launching an A/B test, you must compute the required sample size. If you run the test too short, you will not detect real improvements (Type II error). If you declare a winner too early, you might be fooled by noise (Type I error).
Clicks are the cheapest signal of relevance, but they are biased. Users click position 1 far more than position 5, regardless of relevance (position bias). Users click attractive snippets even for irrelevant documents (attraction bias). Click models decompose observed clicks into examination probability (depends on position) and relevance probability (depends on document quality):
The Position-Based Model (PBM) assumes examination depends only on position. The Cascade Model assumes users scan from top to bottom and stop after the first relevant click. The Dynamic Bayesian Network (DBN) is the most realistic: it models the user as scanning top-to-bottom, clicking if they examine a result AND find it relevant, and potentially continuing past clicks if the clicked result was not fully satisfying. These models let you extract unbiased relevance labels from biased click data — essential for training LTR models.
In practice, the choice of click model matters less than having one at all. A naive approach (raw CTR as relevance) introduces severe position bias. Any click model (even the simple PBM) removes 70-80% of this bias. The more sophisticated models (DBN) remove 90%+ but are harder to fit and require more data.
Let us see how position bias distorts raw click data and why debiasing is critical:
Without debiasing, documents that happen to be shown at position 5 get a CTR of 8% and are deemed low-quality. But after debiasing, their true relevance is 23% — they are actually decent results that suffer from low exposure. Training an LTR model on raw clicks creates a feedback loop: documents at low positions get low labels → model ranks them lower → they get even less exposure → their CTR drops further. Debiasing breaks this loop.
python import numpy as np def dcg(relevances, k): relevances = np.array(relevances)[:k] gains = 2 ** relevances - 1 discounts = np.log2(np.arange(2, len(relevances) + 2)) return np.sum(gains / discounts) def ndcg(relevances, k=10): actual = dcg(relevances, k) ideal = dcg(sorted(relevances, reverse=True), k) return actual / ideal if ideal > 0 else 0.0 # Example: document relevance grades in the order they appear ranking = [3, 0, 2, 1, 0, 3, 0, 0, 1, 2] print(f"NDCG@5: {ndcg(ranking, 5):.3f}") # ~0.82 print(f"NDCG@10: {ndcg(ranking, 10):.3f}") # ~0.79
Symptom: NDCG@10 improved 3% offline, but online A/B test shows -0.5% session success rate. Root cause: The new model is better at ranking navigational queries (where there is one right answer) but worse at informational queries (where users want diverse perspectives). NDCG does not capture diversity — it only measures relevance. Fix: Add α-NDCG (diversity-aware NDCG) to offline metrics. Segment online results by query type to identify where the regression happens.
python import random def team_draft_interleave(ranking_a: list, ranking_b: list, k=10) -> tuple: """Interleave two rankings using Team Draft method. Returns (interleaved_list, team_assignments). team_assignments[i] = 'A' or 'B' indicating which model contributed doc at position i. """ result, teams = [], [] seen = set() ptr_a, ptr_b = 0, 0 while len(result) < k: # Randomly choose which team picks next team_a_turn = random.random() < 0.5 if team_a_turn: while ptr_a < len(ranking_a) and ranking_a[ptr_a] in seen: ptr_a += 1 if ptr_a < len(ranking_a): result.append(ranking_a[ptr_a]) teams.append('A') seen.add(ranking_a[ptr_a]) ptr_a += 1 else: while ptr_b < len(ranking_b) and ranking_b[ptr_b] in seen: ptr_b += 1 if ptr_b < len(ranking_b): result.append(ranking_b[ptr_b]) teams.append('B') seen.add(ranking_b[ptr_b]) ptr_b += 1 return result, teams # After collecting clicks, count wins for each team: # If user clicked positions where team_assignments='A', team A wins. # Aggregate across thousands of queries → statistical test.
A production evaluation system needs: (1) an offline eval pipeline that runs every model checkpoint against 5-10 eval sets (each targeting different query types), (2) a canary deployment system that routes 1% of traffic to the new model and monitors key metrics for anomalies, (3) a full A/B testing framework with proper statistical power analysis (minimum detectable effect, required sample size), (4) interleaving for rapid comparison of close models.
The quality of your offline evaluation depends entirely on the quality of your eval sets. A good eval framework maintains 5-10 eval sets, each capturing a different failure mode:
| Eval Set | Size | What it tests | How labels are collected |
|---|---|---|---|
| Head queries | 5K queries | Quality on the most common queries | Human judgments (3 raters per pair) |
| Tail queries | 10K queries | Long-tail query coverage | Click-based labels (debiased) |
| Freshness queries | 1K queries | Current events, new content | Human judgments (weekly refresh) |
| Adversarial queries | 500 queries | Ambiguous, tricky, or adversarial inputs | Expert annotation |
| Vertical queries | 2K/vertical | Quality per vertical (images, video, news) | Vertical-specific human judgments |
| International queries | 5K/language | Non-English query quality | Native-speaker human judgments |
MMR balances relevance with diversity by penalizing candidates that are too similar to already-selected results:
python import numpy as np def mmr_rerank(query_emb, doc_embs, doc_scores, k=10, lam=0.7): """Maximal Marginal Relevance: balance relevance vs diversity. query_emb: (dim,) query embedding doc_embs: (N, dim) document embeddings doc_scores: (N,) relevance scores from reranker lam: 1.0 = pure relevance, 0.0 = pure diversity Returns: indices of selected documents """ selected = [] remaining = list(range(len(doc_scores))) for _ in range(k): best_idx, best_score = -1, float('-inf') for idx in remaining: # Relevance term relevance = doc_scores[idx] # Diversity term: max similarity to already selected if selected: sims = [np.dot(doc_embs[idx], doc_embs[s]) for s in selected] max_sim = max(sims) else: max_sim = 0 # MMR score score = lam * relevance - (1 - lam) * max_sim if score > best_score: best_score, best_idx = score, idx selected.append(best_idx) remaining.remove(best_idx) return selected
This scenario comes up in every search interview and catches unprepared candidates:
Diagnosis steps:
1. Segment by query type. Model B improved NDCG on informational queries (Wikipedia-style) by +8% but degraded transactional queries (buy/download) by -3%. Informational queries are more common in the test set, inflating offline metrics. But transactional queries drive CTR (users click to buy), so the online metric drops.
2. Check the eval set representativeness. The offline eval set was built from human judgments collected 6 months ago. The query distribution has shifted (more product searches due to holiday season). The eval set no longer represents current traffic.
3. Check for diversity regression. Model B's higher NDCG comes from putting the single best result higher (NDCG rewards this). But it also reduces result diversity, showing 3 results from the same domain. Users scan once, see no variety, and reformulate. Lower diversity → higher reformulation rate → worse user experience.
4. Resolution: Add transactional query eval set. Add diversity metrics (α-NDCG) to offline eval. Add domain diversity constraint to result assembly. Re-run A/B test.
Human relevance judgments are expensive ($0.50-2.00 per judgment). LLM-as-Judge uses GPT-4 or Claude to assess relevance at 1000x the speed and 100x lower cost. TREC 2024 showed that GPT-4 judgments correlate r=0.85 with human judgments for graded relevance. The catch: LLMs have their own biases (verbosity bias, anchoring to position 1). Calibration with a small human-annotated set is essential.
The production setup: use LLM-as-Judge for rapid offline eval (run on every model checkpoint, ~$50 per eval), but always validate with human judgments before launching (reserve $5-10K budget for human eval on promising models). This gives you fast iteration cycles with a human-quality safety net.
Compare two ranking models across multiple metrics. The radar chart shows relative strengths. Hover over metrics to see details.
Models are only as good as the data they train on. At web scale, the data challenges are staggering: billions of web pages that change daily, click logs generating terabytes per hour, feature stores that must serve sub-millisecond lookups, and data quality issues that silently degrade model performance. This chapter covers the infrastructure that turns raw web data into training signals.
A web crawler discovers and downloads web pages. At scale, this is a distributed system problem: how to crawl billions of URLs efficiently, politely (respecting robots.txt and rate limits), and freshly (re-crawling changed pages). The URL frontier is a priority queue of URLs to crawl, ordered by (1) importance (PageRank), (2) change frequency, and (3) time since last crawl.
Key metrics: coverage (what fraction of the web you have indexed), freshness (how recently each page was crawled), and politeness (requests per second per domain — typically 1 req/sec per domain to avoid overwhelming servers).
python import heapq import time from dataclasses import dataclass, field @dataclass(order=True) class CrawlTask: priority: float # lower = more urgent url: str = field(compare=False) domain: str = field(compare=False) last_crawled: float = field(compare=False, default=0) class URLFrontier: def __init__(self, politeness_delay=1.0): self.heap = [] # priority queue self.domain_last_access = {} # domain → timestamp self.politeness = politeness_delay # seconds between same-domain requests def add(self, url, domain, pagerank=0.0, change_freq=1.0): # Priority: lower pagerank (more important pages first) # + staleness bonus (older crawls get priority) priority = -pagerank + 1.0 / max(0.01, change_freq) heapq.heappush(self.heap, CrawlTask(priority, url, domain)) def next_url(self): """Return next URL respecting politeness constraints.""" now = time.time() skipped = [] while self.heap: task = heapq.heappop(self.heap) last = self.domain_last_access.get(task.domain, 0) if now - last >= self.politeness: self.domain_last_access[task.domain] = now for s in skipped: # re-add skipped tasks heapq.heappush(self.heap, s) return task.url skipped.append(task) # domain too recent, skip # All domains on cooldown for s in skipped: heapq.heappush(self.heap, s) return None
The web is full of duplicates. Exact copies (same URL, different mirrors), near-duplicates (same article, different ads), and boilerplate pages (product listings that differ only in one field). Training on duplicates wastes compute and biases the model toward duplicated content.
Exact deduplication: Hash each document (SHA-256). Same hash = exact duplicate. O(1) per document with a hash table. Simple, fast, catches only exact copies.
Near-duplicate detection: MinHash + LSH (Locality-Sensitive Hashing). Represent each document as a set of shingles (n-gram character sequences). Compute k min-hash signatures. Documents with high Jaccard similarity (many matching min-hashes) are near-duplicates. Can process billions of documents with a single pass over the data.
Let us trace MinHash dedup on two documents to see why it works.
The beauty of MinHash: the probability that two documents' MinHash values agree equals their Jaccard similarity. With 128 hash functions, the estimated Jaccard is accurate to ±0.09 with 95% confidence. LSH (Locality-Sensitive Hashing) then buckets documents by bands of hash values, so you only compare documents that share at least one band — reducing the all-pairs comparison from O(n²) to O(n).
A feature store bridges the gap between batch feature computation (offline, slow, cheap) and online feature serving (real-time, fast, expensive). Features like "CTR for this document on similar queries" are computed daily in a batch job (Spark/Flink), written to a feature store (Redis, DynamoDB, Feast), and read at sub-millisecond latency during serving.
The key abstraction: features are keyed by (entity_id, feature_name, timestamp). The feature store handles: (1) point-in-time correctness (for training, only use features available at the time of the interaction — no future leakage), (2) online/offline consistency (the same feature definition is used for training and serving), (3) versioning (rollback to previous feature values if a computation is buggy).
Production feature stores (Feast, Tecton, Hopsworks) provide exactly this abstraction. The online store (Redis, DynamoDB) serves features at p99 < 5ms. The offline store (BigQuery, Hive, Delta Lake) serves historical features for training. A materialization job keeps them in sync. The cost: non-trivial infrastructure complexity, but the alternative (ad-hoc feature computation that inevitably diverges between training and serving) is worse.
python import hashlib from collections import defaultdict def shingles(text, k=5): """Generate character k-shingles from text.""" return {text[i:i+k] for i in range(len(text) - k + 1)} def minhash_signature(shingle_set, num_hashes=128): """Compute MinHash signature: num_hashes minimum hash values.""" sig = [] for i in range(num_hashes): min_h = float('inf') for s in shingle_set: h = int(hashlib.md5(f"{i}_{s}".encode()).hexdigest(), 16) min_h = min(min_h, h) sig.append(min_h) return tuple(sig) def lsh_buckets(signature, bands=16): """Split signature into bands for LSH bucketing.""" rows = len(signature) // bands buckets = [] for b in range(bands): band = signature[b*rows : (b+1)*rows] bucket = hashlib.md5(str(band).encode()).hexdigest() buckets.append((b, bucket)) return buckets # docs sharing a bucket in any band are candidates
Symptom: Model quality degrades over 3 months despite no model changes. Root cause: A data pipeline change silently dropped non-English documents from the training set. The model gradually became English-centric as older training data aged out. Fix: Monitor data distribution metrics (language distribution, document length distribution, source diversity) and alert on significant shifts. Add data validation checks (Great Expectations, Deequ) to the pipeline.
Symptom: Click-through rate features in the feature store are stale by 24 hours. Root cause: The batch job that computes CTR runs once daily at 3 AM. New content published during the day has no CTR feature and falls back to a default value, getting systematically under-ranked. Fix: Add a real-time click counter (using Kafka + Flink) that updates CTR estimates within minutes. Use the batch-computed CTR as a prior and the real-time counter as a Bayesian update.
Symptom: Your embedding model suddenly produces embeddings that cluster into two distinct regions instead of a smooth distribution. Root cause: A new batch of training data was accidentally duplicated 5x, creating a strong bias toward those documents' topics. The model overfit to the duplicated topics during the last training run. Fix: (1) Add deduplication to the training data pipeline (not just the document pipeline). (2) Add a training data validation check that alerts if any (query, doc) pair appears more than N times. (3) Monitor embedding space statistics (average cosine similarity, cluster count via k-means) after each training run.
The highest-impact features for ranking are real-time signals that change within minutes or hours. Building a streaming feature pipeline requires different engineering from batch pipelines:
| Feature | Update latency | Infrastructure |
|---|---|---|
| Trending score | 5 minutes | Flink/Spark Streaming counting queries per topic over sliding windows |
| Real-time CTR | 1 minute | Kafka consumer updating Redis counters with Bayesian smoothing |
| User last-clicked | Seconds | Direct write to user session store on click event |
| Breaking news flag | Minutes | Anomaly detector on query volume spikes per topic |
| Document freshness | Hours | Crawler metadata propagated through the index update pipeline |
The architecture: Kafka as the event bus, Flink for stream processing, Redis for real-time feature serving, with a reconciliation job that periodically validates streaming features against batch-computed ground truth. The key risk: streaming features can be noisy (a click spike could be a bot attack, not genuine interest). Always smooth with batch baselines.
The data flows: (1) Crawler writes raw HTML to object storage (S3/GCS). (2) A parsing job extracts text, metadata, links. Deduplicates using MinHash. (3) An embedding job runs documents through the bi-encoder, writes vectors to the ANN index. (4) A feature computation job computes document quality signals, writes to the feature store. (5) Click logs stream through Kafka into a data lake (Delta Lake/Iceberg). (6) A training data pipeline joins click logs with document features to produce (query, document, label) training triples. (7) Model training reads from the data lake. The entire pipeline runs daily with ~6-hour end-to-end latency.
python import numpy as np from dataclasses import dataclass @dataclass class DataQualityReport: total_docs: int null_rate: dict # feature_name → fraction of nulls distribution_drift: dict # feature_name → KL divergence from baseline duplicate_rate: float alerts: list def check_training_data(current_batch, baseline_stats): """Run data quality checks before training. Catches silent pipeline bugs that degrade model quality.""" alerts = [] report = DataQualityReport( total_docs=len(current_batch), null_rate={}, distribution_drift={}, duplicate_rate=0.0, alerts=[] ) # Check 1: Volume anomaly (±30% from expected) expected = baseline_stats["expected_volume"] if abs(len(current_batch) - expected) / expected > 0.3: alerts.append(f"Volume anomaly: {len(current_batch)} vs expected {expected}") # Check 2: Null rate per feature for feature in baseline_stats["features"]: nulls = sum(1 for row in current_batch if row.get(feature) is None) null_rate = nulls / max(1, len(current_batch)) report.null_rate[feature] = null_rate if null_rate > baseline_stats["max_null_rate"].get(feature, 0.01): alerts.append(f"High null rate for {feature}: {null_rate:.2%}") # Check 3: Distribution drift (simplified KL divergence) # In production, use Great Expectations or custom Flink monitoring report.alerts = alerts return report
Building (query, document, relevance_label) training triples at scale is its own engineering challenge. Here is the production pipeline:
The most underappreciated part of web-scale search is treating your training data as a product with its own SLOs:
| SLO | Target | How to measure |
|---|---|---|
| Freshness | Training data <24 hours stale | Max timestamp in training set vs. current time |
| Coverage | >95% of active documents in training set | Fraction of served documents that appear in training data |
| Label quality | <5% noisy labels | Agreement rate between click-based labels and human judgments |
| Feature completeness | <1% null rate for critical features | Automated null-rate monitoring per feature |
| Distribution stability | <0.05 KL divergence from baseline | Daily distribution checks per feature |
InPars and Promptagator use LLMs to generate synthetic (query, document) pairs for training retrieval models. Given a document, GPT-4 generates plausible queries that the document would answer. This lets you bootstrap training data for new domains (medical, legal, code) where labeled data is scarce. UDAPDR (2024) combines synthetic data generation with domain adaptation, achieving near-human-labeled quality for specialized verticals.
The synthetic data pipeline: (1) Sample 100K diverse documents from the target domain. (2) For each document, prompt GPT-4: "Generate 5 natural search queries that this document would be the best answer for." (3) Use the generated (query, document) pairs as positive training examples. (4) Mine hard negatives from BM25. (5) Fine-tune your bi-encoder on this data. Cost: ~$500 for 500K synthetic pairs (at GPT-4 pricing). Quality: within 3-5% of human-labeled data on most benchmarks. Time: 24 hours end-to-end vs. 4-6 weeks for human annotation.
Watch data flow through the pipeline. Each stage shows throughput and health. Inject a data quality issue to see how it propagates downstream.
For decades, search and recommendations were separate disciplines with separate teams, separate models, and separate infrastructure. Search people cared about queries and documents. Recommendations people cared about users and items. But the boundaries are dissolving. This chapter explains why, and what the unified future looks like.
Three forces are driving the merger:
1. Shared infrastructure. Both search and recommendations use: embedding models (bi-encoders), vector indexes (FAISS/HNSW), feature stores (Redis), LTR models (LambdaMART), and A/B testing frameworks. Running separate infrastructure for each is wasteful. Unifying saves 30-50% in infrastructure costs.
2. Shared embeddings. A user searching "wireless headphones" and a user who browsed three headphone product pages express the same intent through different channels. The embedding model that encodes "wireless headphones" and the embedding model that encodes a user's browsing history can (and should) produce vectors in the same space. Then both explicit search and implicit recommendations become nearest-neighbor lookups in one index.
3. Generative models blur the line. When a user asks ChatGPT "what are the best wireless headphones?", is that search or a recommendation? It is both. The model retrieves information (search), personalizes based on context (recommendation), and generates a coherent response (language modeling). The three capabilities are inseparable.
RAG combines retrieval (search) with generation (LLM). Given a query, first retrieve relevant documents using dense retrieval, then condition the LLM on the retrieved documents to generate an answer. This is the architecture behind every production LLM chatbot that needs to answer questions about up-to-date or private information.
RAG is where search meets generation. The retriever is a search system. The generator is a recommendation system (it selects and synthesizes information for this specific user). And the whole thing runs on transformers.
Let us trace a complete RAG request with concrete tensor shapes and timing.
Generative retrieval goes further: instead of retrieving documents and then generating, the model directly generates the answer from its parametric memory, augmented by retrieval when needed. The Differentiable Search Index (DSI) memorizes document IDs during training and generates them given a query. GENRE generates entity names (Wikipedia titles) instead of opaque IDs, making the output interpretable.
The radical idea: the entire inverted index, vector index, and reranking pipeline could be replaced by a single very large language model that has "memorized" the web. We are not there yet — current generative retrieval works at millions of documents, not billions — but the trajectory is clear.
The practical implication for your career: if you understand both the traditional pipeline (BM25, inverted index, feature engineering, LambdaMART) AND the neural approaches (dense retrieval, transformers, generative retrieval), you can navigate the transition. Companies need people who can maintain the existing system while building the next one. That is exactly the staff-level skill set this lesson prepares you for.
| Use case | Best approach (2025) | Why |
|---|---|---|
| Web search (50B docs) | Hybrid BM25 + dense, cross-encoder rerank | Scale demands BM25 first-pass; cross-encoder adds semantic understanding |
| E-commerce search (10M products) | Unified two-tower for search + recs | Small enough for one index; users seamlessly switch between searching and browsing |
| Enterprise knowledge base (1M docs) | RAG with cross-encoder gating | Users want answers, not links; RAG generates natural language responses |
| Content feed (recs-heavy) | Sequence model (SASRec) + two-tower candidate gen | Temporal user interest matters more than query matching |
| Conversational search (agent) | Multi-turn RAG with query decomposition | Users iteratively refine; agent maintains context across turns |
python import faiss import numpy as np class RAGPipeline: def __init__(self, encoder, index, docs, llm_client): self.encoder = encoder # bi-encoder for queries self.index = index # FAISS ANN index self.docs = docs # list of document texts self.llm = llm_client # LLM API client def answer(self, query: str, top_k=5) -> str: # 1. Embed query q_emb = self.encoder.encode(query) # (1, 128) # 2. Retrieve _, indices = self.index.search(q_emb, top_k) context = "\n\n".join(self.docs[i] for i in indices[0]) # 3. Generate prompt = f"""Answer based on the context below. Context: {context} Question: {query} Answer:""" return self.llm.generate(prompt, max_tokens=500)
Symptom: RAG system confidently generates wrong answers even when the correct document was retrieved. Root cause: The LLM ignores the retrieved context and relies on its parametric knowledge (which is outdated). This is called "retrieval neglect." Fix: (1) Put retrieved context at the end of the prompt (closer to the generation position), (2) use "Answer ONLY based on the provided context" instruction, (3) fine-tune the LLM specifically for RAG (REPLUG, RAFT).
Symptom: RAG returns irrelevant passages and the LLM hallucinates an answer based on them. Root cause: The retrieval model and the generation model have a "domain gap." The retriever was trained on one distribution (e.g., Wikipedia) but is being used on another (e.g., technical documentation). Fix: (1) Fine-tune the retriever on domain-specific data. (2) Add a relevance threshold — if no retrieved document scores above 0.5, return "I don't know" instead of hallucinating. (3) Use a cross-encoder as a relevance gate between retrieval and generation.
Symptom: RAG works well for factual questions but fails for comparison/synthesis queries ("compare X vs Y"). Root cause: Each retrieval pass returns documents about X or Y, but not documents that compare them. The LLM must synthesize across multiple documents, which is harder. Fix: (1) Decompose the query: separately retrieve for X and Y, then prompt the LLM to compare. (2) Use multi-hop retrieval: retrieve once, read, generate a follow-up query, retrieve again. (3) Increase the number of retrieved documents (k=10 instead of k=5) to increase the chance of getting diverse perspectives.
python class GatedRAG: """RAG with a cross-encoder relevance gate. If no retrieved doc passes the relevance threshold, return a 'not enough information' response instead of hallucinating.""" def __init__(self, retriever, reranker, llm, threshold=0.5): self.retriever = retriever self.reranker = reranker # cross-encoder self.llm = llm self.threshold = threshold def answer(self, query: str) -> str: # 1. Retrieve candidates candidates = self.retriever.search(query, top_k=20) # 2. Rerank and filter by relevance scored = self.reranker.rerank(query, candidates, top_k=5) relevant = [(doc, s) for doc, s in scored if s > self.threshold] # 3. Gate: if nothing relevant, don't hallucinate if not relevant: return "I don't have enough reliable information to answer this." # 4. Generate answer from relevant context only context = "\n\n".join(doc for doc, _ in relevant) return self.llm.generate( f"Context:\n{context}\n\nQuestion: {query}\n" f"Answer based ONLY on the context above:" )
python import torch import torch.nn as nn import torch.nn.functional as F class UnifiedEncoder(nn.Module): """One encoder, two heads: search + recommendations.""" def __init__(self, base_encoder, dim=768, proj_dim=128): super().__init__() self.encoder = base_encoder # shared BERT backbone # Shared bottom layers, separate projection heads self.search_head = nn.Linear(dim, proj_dim) # for explicit queries self.rec_head = nn.Linear(dim, proj_dim) # for user profiles self.item_head = nn.Linear(dim, proj_dim) # shared item encoding def encode_query(self, input_ids, mask): h = self.encoder(input_ids, mask).last_hidden_state[:, 0] return F.normalize(self.search_head(h), dim=-1) def encode_user(self, history_embs): # Mean pool over user's last N item embeddings h = history_embs.mean(dim=1) # (B, dim) return F.normalize(self.rec_head(h), dim=-1) def encode_item(self, input_ids, mask): h = self.encoder(input_ids, mask).last_hidden_state[:, 0] return F.normalize(self.item_head(h), dim=-1) def search_loss(self, q_ids, q_mask, d_ids, d_mask, tau=0.05): q = self.encode_query(q_ids, q_mask) d = self.encode_item(d_ids, d_mask) return F.cross_entropy(q @ d.T / tau, torch.arange(len(q), device=q.device)) def rec_loss(self, user_hist, item_ids, item_mask, tau=0.1): u = self.encode_user(user_hist) i = self.encode_item(item_ids, item_mask) return F.cross_entropy(u @ i.T / tau, torch.arange(len(u), device=u.device))
The converged architecture: one embedding model serves both search queries and user profiles. One ANN index stores all content. Search queries and user profiles are both embedded into the same space. The only difference is the ranking stage: search applies a relevance-focused reranker, recommendations apply a personalization-focused reranker. Both feed into the same result assembly layer that handles diversity, business rules, and A/B testing.
Symptom: After unifying the search and recommendation embedding models, search quality drops 2% while recommendation quality improves 3%. Root cause: The two tasks have conflicting gradient signals. Search wants "wireless headphones" close to product reviews; recommendations want it close to related products the user has purchased (which may include cables, cases, stands). The shared encoder is compromised. Fix: Share only the bottom 8 of 12 transformer layers; keep the top 4 task-specific. This lets the lower layers learn general text understanding while the upper layers specialize for each task. This "hard parameter sharing with task-specific heads" pattern is standard in multi-task learning.
Search as a conversation. Users will increasingly interact with search through multi-turn dialogue, not single queries. The system must maintain context across turns (remember that "it" refers to the headphone from the previous query).
Agentic search. AI agents that can browse the web, fill out forms, compare prices, and complete purchases — this is exactly what Parallel is building. The agent IS the search interface. The agent does not return ten blue links; it returns an answer, a comparison table, or a completed action. This requires the agent to have a full search pipeline internally (retrieve, rank, synthesize) plus the ability to take actions (click, fill forms, submit orders).
Personalized foundation models. Instead of a generic retrieval model + personalization layer, train user-specific adapters (LoRA) that bake personalization into the model itself. Early results show 5-10% improvement over two-stage approaches.
Multimodal search and recommendations. Users increasingly search with images (screenshot this product, find something similar), voice (conversational queries), and mixed modality (photo of a dish → find recipe). Models like SigLIP (2024) and Jina CLIP v2 produce embeddings for text and images in the same space, enabling cross-modal retrieval with the same ANN infrastructure.
How this all comes together at a company like Parallel, which builds AI agents with programmatic web access:
Three domains converging into one system. Toggle between 2015 (separate) and 2025 (unified) architectures to see how the boundaries dissolve.
Everything from the previous 10 chapters comes together here. You are going to operate a complete search pipeline — from query to ranked results — and tune every stage. Type a query, choose your retrieval model, set the latency budget, and watch the pipeline process your request in real time. Break it. Fix it. Understand how every knob affects quality and performance.
Configure the pipeline, then click Execute Query to watch it run. Adjust parameters to optimize the quality-latency tradeoff. Try to hit NDCG>0.7 with latency<200ms.
The top section shows two critical metrics: latency (must be <200ms) and NDCG@10 (want >0.7). The waterfall chart below shows each pipeline stage's contribution to total latency. The red dashed line at 200ms is the budget. If any bar extends past it, you are over budget.
The interaction between settings reveals the fundamental tradeoffs of search system design:
More reranking candidates = higher quality but higher latency. Going from 50 to 200 candidates improves NDCG by ~2% but triples reranking time. Going from 200 to 500 gives diminishing returns (<0.5% NDCG) but doubles the time again. The production sweet spot is 100-200 candidates.
Bigger index = higher latency for both BM25 and ANN. A 100B-document index takes 2-3x longer to search than a 1B index. This is where sharding and caching earn their keep — without them, latency grows linearly with index size.
Cross-encoder vs. ColBERT vs. LambdaMART. Cross-encoder gives the best quality but highest latency. ColBERT is ~60% of the latency at ~90% of the quality (token-level interaction without full cross-attention). LambdaMART is nearly free latency-wise (tree traversal is microseconds) but relies on hand-crafted features.
Experiment with these scenarios:
| Scenario | Settings | What you learn |
|---|---|---|
| Speed demon | BM25 only, no reranker, 10M docs | Ultra-fast but low relevance — misses semantic matches |
| Quality maximizer | Hybrid, cross-encoder, 500 candidates | Best NDCG but latency blows the budget at scale |
| Production sweet spot | Hybrid, cross-encoder, 100 candidates, 1B docs | The standard production configuration — balance of quality and speed |
| ColBERT middle ground | Hybrid, ColBERT, 200 candidates | Better latency than cross-encoder at slightly lower quality |
If you cannot hit NDCG > 0.7 and latency < 200ms simultaneously, here is the decision tree:
Latency too high? (1) Reduce reranking candidates from 200 to 100 (—1% NDCG, —50% rerank latency). (2) Switch from cross-encoder to ColBERT (—2% NDCG, —60% rerank latency). (3) Reduce ANN nprobe (—0.5% recall, —30% ANN latency). (4) Add result caching for head queries (0% quality impact, —20-40% average latency).
NDCG too low? (1) Switch from BM25-only to hybrid retrieval (+5-10% NDCG, +5ms latency). (2) Add cross-encoder reranking (+8-12% NDCG, +40ms latency). (3) Increase reranking candidates from 50 to 200 (+2-3% NDCG, +30ms latency). (4) Fine-tune the bi-encoder on domain-specific data (+3-5% NDCG, no latency impact).
The Pareto frontier for web search: you cannot improve NDCG without increasing latency, and vice versa. The art is finding the point on this frontier that matches your product's requirements. A search engine optimizes for quality (users wait 200ms). An autocomplete system optimizes for speed (must respond in 50ms). An ad auction optimizes for revenue-per-query within a latency budget.
This chapter distills everything into the formats you will face in a staff-level interview at Parallel (or any web-scale AI company). You get: a cheat sheet of the most important concepts, system design questions with solution skeletons, coding exercises, and debugging scenarios.
| Concept | One-sentence definition | When to mention it |
|---|---|---|
| BM25 | Lexical scoring with TF saturation and length normalization — the baseline that every system includes | Any retrieval question |
| Bi-encoder | Independent encoding of query and document into shared embedding space, scored by dot product | Dense retrieval, candidate generation |
| Cross-encoder | Joint encoding of (query, document) pair for maximum accuracy at high latency cost | Reranking stage |
| ColBERT (MaxSim) | Token-level late interaction: more accurate than bi-encoder, faster than cross-encoder, but larger index | When asked about the bi-encoder vs cross-encoder tradeoff |
| InfoNCE loss | Contrastive loss that pushes positive pairs together and negative pairs apart using in-batch negatives | Training embedding models |
| Two-tower model | Bi-encoder for recommendations: user tower + item tower, inner product scoring | Recommendation system design |
| LambdaMART | Gradient-boosted trees with NDCG-aware gradient weighting — production standard for LTR | Learning to rank, feature engineering |
| NDCG@k | Standard ranking metric that accounts for graded relevance and position discount | Any evaluation question |
| IVF-PQ | FAISS compound index: cluster-then-compress for billion-scale ANN search | Serving, latency, scale questions |
| RAG | Retrieve relevant documents, then generate answer conditioned on them | Any question about LLMs + search |
Skeleton: (1) Two-stage retrieval: BM25 over sharded inverted index + dense ANN over sharded vector index, in parallel. (2) Merge top-1000 candidates. (3) Feature enrichment from Redis feature store. (4) LambdaMART for lightweight ranking (1000 → 200). (5) Cross-encoder reranking on GPU pool (200 → 20). (6) Result assembly with diversity and personalization. Latency budget: 150ms. Key decisions: scatter-gather fanout width, reranking batch size, cache layers.
Skeleton: One bi-encoder produces embeddings for both queries and user profiles. One ANN index stores all content embeddings. Search: query → embed → ANN → rerank (relevance-focused). Recs: user history → embed → ANN → rerank (personalization-focused). Shared infrastructure: embedding model, vector index, feature store, A/B testing. Separate: reranking models (different objectives), result assembly rules (different diversity constraints).
Skeleton: (1) Check offline metrics on recent eval sets — is the model itself worse? (2) Check data pipeline health — are training features stale, missing, or distribution-shifted? (3) Check index freshness — has the crawl coverage dropped? (4) Check serving infrastructure — are latency spikes causing timeouts in the reranking stage, falling back to lower-quality results? (5) Check external factors — has the query distribution shifted (seasonal, news events)?
For each coding exercise, here are the patterns interviewers want to see:
python # EXERCISE 5: Hybrid retrieval fusion (Reciprocal Rank Fusion) # Given two ranked lists, combine them into one using RRF def reciprocal_rank_fusion(rankings: list[list[int]], k=60) -> list[int]: """Combine multiple ranked lists using Reciprocal Rank Fusion. Each ranking is a list of document IDs in rank order. RRF score = sum(1 / (k + rank_i)) across all rankings. """ scores = {} for ranking in rankings: for rank, doc_id in enumerate(ranking): scores[doc_id] = scores.get(doc_id, 0) + 1.0 / (k + rank + 1) return sorted(scores, key=scores.get, reverse=True) # Usage: combine BM25 and dense retrieval results bm25_results = [42, 17, 89, 3, 55] # top-5 from BM25 dense_results = [89, 42, 7, 31, 17] # top-5 from dense hybrid = reciprocal_rank_fusion([bm25_results, dense_results]) # Result: [42, 89, 17, ...] — docs appearing in both lists rank highest
Skeleton: (1) Week 1-2: instrument click logging. Every search result click gets logged with (user_id, query, doc_id, position, timestamp, dwell_time). (2) Week 3-4: build a simple user profile. For each user, maintain the last 50 clicked documents' embeddings. The user profile is the mean of these embeddings. (3) Week 5-8: modify the result assembly layer. For each candidate document, compute cosine similarity between the document embedding and the user profile embedding. Add this as a feature to the LTR model. Retrain LambdaMART with this new feature. (4) Week 9-10: A/B test. (5) Week 11-12: iterate based on results. This is the minimum viable personalization — it adds ~3-5% NDCG improvement with minimal infrastructure investment.
Skeleton: (1) Query classification: train a text classifier to detect vertical intent ("chicken parmesan recipe" → recipes vertical). (2) Vertical-specific indexing: extract structured fields (ingredients, cook time, rating) during crawl. Store as additional columns in the feature store. (3) Vertical-specific ranking: train a separate LTR model for the vertical, with vertical-specific features (e.g., recipe rating, ingredient match). (4) Vertical-specific UX: when vertical is triggered, show a rich result card (recipe card with image, time, rating) instead of standard blue links. (5) Fallback: if confidence in vertical classification is low, show standard results with the vertical card as a secondary result.
These are the quick-fire "warm-up" questions interviewers use in the first 5 minutes. Have a crisp 2-3 sentence answer ready for each:
| Question | Answer skeleton |
|---|---|
| "What is the difference between precision and recall?" | Precision = fraction of retrieved docs that are relevant. Recall = fraction of relevant docs that are retrieved. You trade off between them: more aggressive retrieval increases recall but hurts precision. |
| "Why use cosine similarity instead of dot product?" | Cosine normalizes by vector magnitude, so it measures angle between vectors (semantic direction) regardless of scale. Dot product rewards both alignment AND magnitude. Use cosine when embeddings are L2-normalized (they are equivalent); use dot product when magnitude carries information (e.g., document importance). |
| "What is the curse of dimensionality?" | In high dimensions, all points become equidistant. With 768-d embeddings, the ratio of nearest-to-farthest neighbor distance approaches 1.0. This makes exact nearest neighbor search useless — you need approximate methods that exploit local structure. |
| "Why not just use the LLM directly for retrieval?" | Cost and latency. A BERT forward pass takes 5ms; a GPT-4 call takes 500ms and costs 100x more. At 50K QPS, you cannot afford an LLM call per query. Use cheap models (BM25, bi-encoder) for retrieval, expensive models (cross-encoder, LLM) only for reranking the top candidates. |
| "Explain the bias-variance tradeoff in ranking." | A high-bias model (BM25) consistently underperforms because it cannot capture semantics, but it is stable across datasets. A high-variance model (fine-tuned cross-encoder) can overfit to training data distribution and degrade on new query patterns. The production solution: ensemble multiple models (BM25 + dense + LTR) to reduce variance while capturing complex patterns. |
Here is how a staff-level interview exchange should sound. Practice delivering this level of specificity:
Strong answer: "First, I would check the A/B test setup itself — was traffic split correctly? Is the sample size sufficient for the expected effect size? A 3% NDCG improvement might translate to only 0.2% CTR improvement, which needs more traffic to detect. Second, I would segment the results by query type. Maybe the improvement is concentrated on informational queries where users do not click (they read the snippet) while transactional queries (where users click) saw no improvement. Third, I would check for position bias in the NDCG evaluation — if the offline eval uses biased click labels, the 3% might be an artifact. Fourth, I would look at secondary metrics: did dwell time improve? Did reformulation rate decrease? These might show improvement even when CTR does not. Finally, I would consider running an interleaving experiment, which is 10-100x more sensitive than A/B testing for detecting ranking quality differences."
Strong answer: "Three strategies, deployed in layers. (1) Population-level priors: for a brand new user, use the most popular results for each query, weighted by the user's geographic region and device type (the only signals we have). (2) Quick profile bootstrap: after just 3-5 clicks, compute a rudimentary user embedding by averaging the embeddings of clicked documents. This immediately enables basic personalization. (3) Explicit preference collection: on first visit, show a 'pick 5 topics that interest you' widget. This provides direct signal without requiring any click history. The key insight is that cold start is a spectrum, not a binary — you get progressively better with each interaction, and your system should degrade gracefully to population-level baselines, never to random."
| Paper | Year | Key contribution |
|---|---|---|
| DPR (Karpukhin et al.) | 2020 | Dense Passage Retrieval with bi-encoders + hard negatives |
| ColBERT (Khattab & Zaharia) | 2020 | Late interaction with per-token embeddings |
| ANCE (Xiong et al.) | 2021 | Approximate nearest neighbor negative mining for training |
| SPLADE (Formal et al.) | 2021 | Learned sparse representations compatible with inverted indexes |
| DSI (Tay et al.) | 2022 | Differentiable Search Index — generative retrieval |
| InPars (Bonifacio et al.) | 2022 | LLM-generated synthetic queries for training |
| Matryoshka (Kusupati et al.) | 2022 | Multi-resolution embeddings from a single model |
| GritLM (Muennighoff et al.) | 2024 | Unified generative + embedding model |
| E5-Mistral (Wang et al.) | 2024 | LLM-based embeddings that rival specialized models |
| LLM2Vec (BehnamGhader et al.) | 2024 | Convert any decoder LLM into a strong embedding model |
Interviewers do not expect you to have memorized every paper. But for each, know:
DPR: Why in-batch negatives matter (cheap 1024 negatives from a batch of 1024). Why hard negatives from BM25 top-100 boost quality 10-15%. The architecture (two BERT-base encoders, one for query, one for passage).
ColBERT: The MaxSim operation and why it captures token-level interactions. The storage cost (one vector per token). ColBERTv2's residual compression that makes it practical.
SPLADE: How it bridges lexical and neural retrieval. The key idea (predict term weights for ALL vocabulary tokens, not just those present in the document). Compatibility with inverted indexes.
Matryoshka: The training trick (apply loss at multiple truncation points simultaneously). Why it enables adaptive-precision search (coarse 32-d retrieval → fine 768-d reranking from the same model).
GritLM: The unification thesis (one model for both embedding and generation). Why this matters for the convergence of search and GenAI. Current limitations (not yet competitive with specialized models at billion-scale retrieval).
Open with the architecture. When asked any system design question about search, start by drawing the 5-stage pipeline on the whiteboard (query understanding → retrieval → features → reranking → assembly). Then drill into whichever stage the interviewer asks about. This shows you have the full picture and can zoom in anywhere.
Always quantify. "It's fast" means nothing. "BM25 over 64 shards returns top-1000 in 12ms p99" shows you have built real systems. Memorize the key numbers from the table in Chapter 0.
Discuss tradeoffs, not solutions. Staff-level engineers are expected to navigate tradeoffs, not just implement solutions. "Cross-encoder is the best reranker" is junior-level. "Cross-encoder gives the best NDCG but costs 40ms per query for 100 candidates; ColBERT gives 90% of the quality at 40% of the latency; LambdaMART is near-free but requires manual feature engineering — I would start with LambdaMART and add cross-encoder only for high-value verticals" is staff-level.
Connect to the product. Technical excellence is necessary but not sufficient at the staff level. Always connect your technical decisions to user outcomes: "This 2% NDCG improvement translates to roughly 500K fewer reformulation events per day, meaning half a million users per day find what they need on the first try instead of having to rephrase their query."
Show the debugging mindset. When presented with a problem, do not jump to solutions. First, gather data: "What changed recently? What do the metrics show? Can we segment by query type, user type, device, geography?" Then form hypotheses. Then test them. This systematic approach is what separates staff engineers from senior engineers.
Self-assess your readiness across the five interview dimensions. Click a dimension to review the relevant chapters.
Search and recommendations are the most impactful ML systems in the world. Google Search serves 8.5 billion queries per day. YouTube's recommendation system accounts for 70% of watch time. Amazon's "customers also bought" drives 35% of revenue. The models are not the most sophisticated in ML (no diffusion, no RLHF, no chain-of-thought reasoning). But they operate at a scale and latency constraint that makes them arguably the hardest engineering problem in AI. You are not just building a model — you are building the infrastructure that connects billions of people with the information and products they need. That is a profoundly meaningful mission, and the interview is your chance to show you can execute it at the highest level.
"We tend to overestimate the effect of a technology in the short run and underestimate the effect in the long run." — Roy Amara
This lesson covers the full scope of a staff-level web-scale search and recommendations role. The concepts, architectures, code, debugging strategies, and frontier research presented here represent the state of the art as of 2025. For the latest developments, follow SIGIR, WWW, RecSys, NeurIPS IR workshops, and the research blogs of Google, Meta, and Microsoft.
Related Gleams lessons: Transformers for attention mechanics, CLIP / Contrastive Learning for embedding training, Agentic Engineer for agent-based search systems.