ML Systems · Information Retrieval

Vector Databases

How to find the 10 most similar embeddings among a million — in under 50 ms.

Prerequisites: Basic Python + what a vector is (a list of numbers). That's it.
10
Chapters
8+
Simulations
0
Assumed Math

Chapter 0: Why Do We Need This?

You've built a document search engine. You embedded 1,000,000 product descriptions using a language model — each document is now a list of 768 floating-point numbers. A user types a query, you embed that query too, and you want to return the 10 most similar products.

The obvious approach: compare the query's vector to every one of the million vectors, compute a similarity score for each, then sort and return the top 10. That's called brute-force search, or exact nearest-neighbor search.

It works perfectly when you have 1,000 documents. At 1,000,000 documents? You're computing 1,000,000 dot products, each over 768 dimensions. That's 768 million multiplications per query. On a modern CPU, this takes ~500 ms. Your users wanted results in under 50 ms.

The core problem: brute-force nearest-neighbor search is O(n · d) — linear in the number of vectors and the number of dimensions. Neither n nor d is small in practice. We need a way to skip most comparisons.

This is what a vector database solves. It's an index structure that lets you find approximate nearest neighbors in O(log n) or even O(1) time, trading a tiny bit of recall for massive speedup. The rest of this lesson walks through every major approach — how they work, why they're fast, and what they sacrifice.

First, let's visualize the brute-force problem. Below are 200 random 2D points (imagine these are 768-dim embeddings squashed to 2D). The query is the orange dot. Brute force must check all gray lines before returning the winners.

Brute Force: Every Distance Must Be Checked

Click anywhere to set the query point. Watch how every single point must be examined.

Points 120 Top-K 5
If brute-force takes 500 ms for 1M vectors, and you double the dataset to 2M vectors, how long will it take?

Chapter 1: Brute Force — The Honest Baseline

Before we optimize anything, we need to know exactly what we're optimizing. Brute-force search is perfect — it always returns the true nearest neighbors, with zero error. It's also the benchmark that every approximate method is measured against.

The most common similarity measure for embedding vectors is cosine similarity: how aligned are two vectors in direction? Two vectors pointing in the same direction score 1.0; perpendicular vectors score 0; opposite directions score -1. For normalized vectors, cosine similarity equals the dot product.

sim(q, v) = ½ (q · v) / (|q| · |v|)

For normalized vectors (|q| = |v| = 1):   sim(q, v) = q · v = ∑i qi · vi

FAISS (Facebook AI Similarity Search) is the gold standard library for this. Here's the complete brute-force pipeline — it truly is this simple:

python
import faiss
import numpy as np

d = 768           # dimension (e.g. BERT embedding size)
n = 1_000_000    # number of documents

# Build the index — FlatL2 = exact Euclidean distance
index = faiss.IndexFlatL2(d)

# Add all your vectors (must be float32, shape [n, d])
vectors = np.random.randn(n, d).astype(np.float32)
faiss.normalize_L2(vectors)   # convert L2 to cosine similarity
index.add(vectors)            # O(n·d) but only done once

# Query — returns k nearest neighbors for each query
query = np.random.randn(1, d).astype(np.float32)
faiss.normalize_L2(query)
distances, indices = index.search(query, k=10)  # O(n·d) per query

# indices[0] = [idx1, idx2, ...idx10] — the top-10 results
# distances[0] = [0.02, 0.05, ...] — their distances
When is brute force fine? Under ~10,000 vectors, brute force runs in under 5 ms and requires zero engineering. Don't add complexity until you actually have a scaling problem. A few hundred thousand vectors is also viable with GPU acceleration.

The performance numbers at scale are unambiguous:

Dataset SizeBrute Force TimeUsable?
10K vectors~0.5 msYes
100K vectors~5 msYes
1M vectors~500 msMarginal
10M vectors~5 sNo
1B vectors~8 minNever
Linear Scaling Visualized

Each bar shows the simulated query time as dataset size grows. The relationship is perfectly linear.

You have 500K normalized vectors and use IndexFlatL2. You then switch to IndexFlatIP (inner product). How does query time change?

Chapter 2: IVF — Cluster First, Search Second

Here's the key insight behind Inverted File Index (IVF): if your vectors cluster in semantic space (and they do — similar documents land near each other), you don't need to search all clusters. Only search the ones near your query.

The build phase is a one-time k-means clustering over all your vectors. Say you create 1000 clusters (called centroids). Each vector is assigned to its nearest centroid and stored in that centroid's "posting list." Think of it like a library sorted by genre — when you want a mystery novel, you go straight to the mystery section, not every shelf.

Build
k-means cluster all n vectors into nlist centroids. Each vector stored in one posting list.
Query
Find the nprobe nearest centroids to the query. Only search those posting lists.
Return
Best k results from the searched posting lists. Some true neighbors may be missed.

The critical parameter is nprobe: how many clusters to search. With nprobe=1 you're lightning fast but may miss true neighbors. With nprobe=nlist you've reproduced brute force. The art is finding the sweet spot.

python
import faiss

d = 768
nlist = 1000    # number of clusters (√n is a good rule of thumb)

# Build
quantizer = faiss.IndexFlatL2(d)        # find nearest centroid
index = faiss.IndexIVFFlat(quantizer, d, nlist, faiss.METRIC_L2)
index.train(vectors)   # k-means: O(n · d · nlist · iterations)
index.add(vectors)     # assign each vector to nearest centroid

# Query: search only 50 of the 1000 clusters
index.nprobe = 50     # tune this: higher = better recall, slower
distances, indices = index.search(query, k=10)
# Speed: ~50x faster than brute force (50/1000 clusters searched)
# Recall: ~90-95% at nprobe=50 (some true top-10 may be missed)
The nprobe tradeoff: Every ANN index makes a recall-vs-latency bargain. With IVF, nprobe is the knob. In practice, nprobe=10-100 gives 90-99% recall at 10-100× speedup over brute force. The right value depends on your data's cluster structure.
IVF: Cluster Assignment + Search

Adjust nprobe to see which clusters get searched (highlighted). More clusters = better recall, slower query.

Clusters (nlist) 8 nprobe 2
You have 1M vectors and nlist=1000 clusters. With nprobe=100, roughly what fraction of vectors does each query examine?

Chapter 3: HNSW — The Graph That Navigates Itself

IVF works well, but it has a problem: during build, vectors on the boundary of a cluster may have their true neighbors in the adjacent cluster. If that cluster isn't in your nprobe set, you miss them. What if, instead of clusters, you built a network of shortcuts — a graph where you could navigate from anywhere to anywhere in logarithmic steps?

That's Hierarchical Navigable Small World (HNSW). The name is a mouthful, but the idea is elegant. Think of it like a skip list for vectors:

The skip-list analogy: A skip list has multiple layers. The top layer has few nodes but long "express lane" connections. The bottom layer has all nodes with short connections. You start fast at the top and refine locally at the bottom. HNSW does the same thing — but in high-dimensional space.

Building HNSW:

  1. Each new vector is randomly assigned to levels 0 through L (exponentially fewer vectors per level).
  2. At each level, it connects to its M nearest neighbors at that level.
  3. The resulting graph has "highway" connections at the top and "local" connections at the bottom.

Querying HNSW: start at the top level's entry point. Greedily move to your nearest neighbor. Drop to the next level. Repeat. At level 0, do a beam search with ef candidates. Return the top-k.

python
import faiss

d = 768
M = 32          # connections per node (higher = better recall, more RAM)
ef_construction = 200  # search width during build (higher = better index, slower build)

index = faiss.IndexHNSWFlat(d, M)
index.hnsw.efConstruction = ef_construction

# Build: O(n · M · log(n)) — much slower than IVF build
index.add(vectors)   # no train() needed

# Query: adjust ef for recall vs speed
index.hnsw.efSearch = 64   # beam width at query time
distances, indices = index.search(query, k=10)
# Recall: ~99% at efSearch=64; Speed: sub-millisecond per query
PropertyIVFHNSW
Build timeFast (k-means)Slow (graph construction)
Query timeFastVery fast
RecallGood (nprobe-tunable)Excellent (graph structure)
MemoryLow (just stores vectors + lists)Higher (stores graph edges)
Supports deletes?No (requires rebuild)No (marks as deleted)
HNSW: Multi-Layer Graph Navigation

Watch how the search starts at the top layer (few nodes, long jumps) and refines at the bottom (many nodes, local search). Click "Run Query" to animate a search.

HNSW builds a multi-layer graph. Why does the top layer have far fewer nodes than layer 0?

Chapter 4: Product Quantization — Shrink the Vectors

HNSW and IVF solve the speed problem by reducing how many vectors you compare. But they don't solve the memory problem. A million 768-dim float32 vectors takes 768 × 4 × 1,000,000 = 3 GB of RAM. That's just the vectors — the graph edges in HNSW can add another 2-5 GB.

Product Quantization (PQ) is the compression layer. It reduces each 768-dim vector to ~64 bytes (down from 3072 bytes), a 48× compression, with only a small loss in recall accuracy.

Here's how it works. Split the 768-dim vector into M subvectors of d/M dimensions each. For each subspace, train a small k-means codebook with K codewords (typically K=256). Replace each subvector with the index of its nearest codeword (1 byte per subspace, since 2⁸=256).

Original: v ∈ ℝ768 — 3072 bytes (float32)

Split into M=12 subvectors ∈ ℝ64 each

Each subvector → 1 byte index into codebook of K=256 codewords

Compressed: 12 bytes per vector — 256× compression

Distance computation becomes a table lookup. You precompute the distance from the query to every codeword in every subspace (M × K lookups), then for any database vector, its approximate distance is just M table lookups and M additions. Massively faster than a full dot product over 768 dims.

python
import faiss

d = 768
M = 12      # number of subvectors; d must be divisible by M
nbits = 8   # 8 bits = 256 codewords per subspace

# PQ-only index (good for memory, not as fast as HNSW+PQ)
index = faiss.IndexPQ(d, M, nbits)
index.train(vectors)   # learns M × 256 codewords (needs ~50K samples)
index.add(vectors)     # compresses each vector to M bytes

distances, indices = index.search(query, k=10)
# Memory: n × M bytes = 1M × 12 = 12 MB (vs 3 GB for FlatL2)
# Recall: ~85-90% at top-10 with these settings
The compression math: 12 subvectors × 1 byte each = 12 bytes per vector. At 1M vectors: 12 MB total. Compare to 3 GB for float32. You can hold 250× more vectors in the same RAM budget — enabling billion-scale search on a single machine.
PQ Compression: Subvector Quantization

A 2D vector gets split into subvectors and snapped to the nearest codebook centroid. The error between original and reconstructed is the quantization loss.

Codebook size K 4
You use PQ with M=16 subspaces and nbits=8 (256 codewords per subspace). How many bytes does each compressed vector use?

Chapter 5: Hybrid Indexes — Combining the Tricks

IVF reduces how many vectors you compare. PQ reduces how expensive each comparison is. Combine them and you get the two most important indexes in production: IVF+PQ and HNSW+PQ.

IVF+PQ (also written IVFPQ) is the workhorse of billion-scale search. You cluster with IVF to narrow to a few posting lists, then use PQ's lookup table trick to score those vectors cheaply. The combined speedup is multiplicative: if IVF eliminates 99% of vectors and PQ makes each remaining comparison 50× faster, you're 5000× faster than brute force.

FAISS Index Factory strings are the shorthand way to compose these. IndexIVFPQ = IVF + PQ. IndexHNSWPQ = HNSW + PQ. The factory string system lets you describe arbitrarily complex indexes as a readable recipe.
python
import faiss

d = 768

# Method 1: explicit construction
nlist = 4096    # clusters
M = 16          # PQ subvectors (each 768/16 = 48 dims)
nbits = 8       # 256 codewords

quantizer = faiss.IndexFlatL2(d)
index = faiss.IndexIVFPQ(quantizer, d, nlist, M, nbits)
index.train(training_vectors)    # needs ~100K samples minimum
index.add(vectors)

index.nprobe = 64    # tune: try 16, 32, 64, 128
distances, indices = index.search(query, k=10)

# Method 2: index factory string — same result
index2 = faiss.index_factory(d, "IVF4096,PQ16")

# HNSW + PQ: great recall but harder to tune
index3 = faiss.index_factory(d, "HNSW32,PQ16")

# With flat reranking (rescore top-4k with exact distances):
index4 = faiss.index_factory(d, "IVF4096,PQ16,RFlat")
IndexRecall@10QPS (1M, 1 thread)RAM
FlatL2 (brute force)100%~23 GB
IVF4096,Flat~95%~4003 GB
IVF4096,PQ16~85%~2,00050 MB
HNSW32~99%~1,0005 GB
HNSW32,PQ16~95%~800250 MB
Choosing an index: If recall matters most → HNSW. If memory matters most → IVF+PQ. If you need both → IVF+PQ with flat reranking (store full vectors separately, use PQ for fast candidate generation, then exact rescore the top 4K). This is what most production systems do.
Recall vs Latency Tradeoff

Compare how each index type occupies the recall-latency space. Higher and left is better.

The FAISS index string "IVF4096,PQ16" means:

Chapter 6: The DB Layer — Beyond Raw Indexes

FAISS is a C++ library — you manage your own memory, write your own serialization, and handle updates manually. A vector database wraps these indexes in a production-grade system that handles persistence, filtering, replication, and APIs.

The key capability that separates a vector DB from a raw FAISS index is metadata filtering. Your query isn't just "find the 10 nearest vectors" — it's "find the 10 nearest vectors where product_category='shoes' AND price < 100 AND in_stock=true." Vector DBs combine ANN search with structured filters.

Pre-filter vs Post-filter: Post-filtering (search ANN, then apply filter) fails when the filter is selective — if only 1% of vectors pass the filter, you need to retrieve 1000 candidates to get 10 valid results. Pre-filtering (apply filter first, then search only the matching subset) requires the index to support filtered search natively, which is architecturally harder.

The major players:

SystemBackend IndexFilteringBest For
PineconeProprietary (HNSW-based)Pre-filterFully managed, easy start
WeaviateHNSWPre-filter + postHybrid BM25+vector search
ChromaHNSW (hnswlib)Post-filterLocal dev, RAG prototypes
QdrantHNSWPre-filterPayload filtering, Rust perf
pgvectorIVF or HNSWPostgreSQL WHEREExisting Postgres users
MilvusIVF, HNSW, DiskANNPre-filterHigh-throughput, cloud-native

Hybrid search is the other key feature: combining sparse keyword matching (BM25/TF-IDF) with dense vector similarity. A query like "apple iPhone battery replacement" benefits from both: the BM25 component exact-matches "iPhone" and "battery," while the vector component captures semantic similarity to related product descriptions. Results are merged using Reciprocal Rank Fusion (RRF) or a weighted score.

python
# Chroma — the simplest way to start
import chromadb

client = chromadb.Client()
collection = client.create_collection("products")

# Insert: pass raw documents, Chroma handles embedding
collection.add(
    documents=["Red running shoes for marathon", "Nike Air Max 90"],
    metadatas=[{"category": "shoes", "price": 89}, {"category": "shoes", "price": 130}],
    ids=["doc1", "doc2"]
)

# Query with metadata filter
results = collection.query(
    query_texts=["comfortable running shoes under $100"],
    n_results=5,
    where={"price": {"$lt": 100}}   # metadata filter
)
You search for top-10 results with a filter that only 0.1% of vectors pass (1,000 out of 1M). With post-filtering ANN retrieving 1000 candidates, what's the problem?

Chapter 7: Production Patterns

Getting an ANN index to work on a laptop is the easy part. Running it reliably at scale is where the real engineering begins.

Batch Ingestion

Never add vectors one at a time. FAISS's index.add() is not thread-safe, and IVF indexes need to be retrained if the data distribution shifts significantly. The standard pattern: accumulate updates in a buffer (Redis queue, S3 staging area), then bulk-ingest in off-peak batches. For HNSW you can add online, but build time grows super-linearly — pre-build offline and serve the frozen index.

Handling Updates and Deletes

FAISS doesn't support deletes — there's no "remove vector 47382" operation. Vector DBs handle this in two ways: soft deletes (mark as deleted in metadata, filter from results) and periodic reindex (rebuild the index nightly from the canonical database, swapping the live index atomically). Most production systems use both.

Monitoring Recall

ANN recall degrades silently as your data distribution shifts. You add new product categories, your users start querying differently, your embedding model gets updated. The fix: hold out a small ground-truth evaluation set (1,000 queries with known true neighbors), and run it against your live index daily. Alert if recall drops below threshold.

python
def measure_recall(index, ground_truth_queries, true_neighbors, k=10):
    """Recall@k: fraction of true top-k that appear in ANN top-k"""
    _, ann_results = index.search(ground_truth_queries, k)
    recalls = []
    for ann, true in zip(ann_results, true_neighbors):
        hits = len(set(ann) & set(true[:k]))
        recalls.append(hits / k)
    return np.mean(recalls)

# Run daily, alert if recall drops below 0.90
recall = measure_recall(index, eval_queries, gt_neighbors)
if recall < 0.90:
    alert(f"Recall dropped to {recall:.2%} — consider retuning nprobe")

Multi-tenancy

In a SaaS product, each customer's data must be isolated — a search for customer A must never return customer B's documents. The two approaches: namespace isolation (one index per tenant — simple but doesn't scale past a few thousand tenants) and metadata partitioning (one shared index with tenant_id as a mandatory pre-filter). The metadata approach scales better but requires the vector DB to support efficient pre-filtering.

Recall Drift Monitoring

Simulate recall over time as data distribution shifts. See how a reindex event (↑) restores recall.

Your IVF index was trained when 80% of your data was about "electronics." Now, after a year, 60% of new data is about "fashion." What's the likely effect?

Chapter 8: Showcase — Build, Query, Compare

Everything in one place. Insert 2D points into a dataset, build an IVF or HNSW index over them, then issue a query and watch the search path unfold. You'll see exactly which points get examined — and which get skipped.

Interactive Vector Index Builder

Click the canvas to add data points. Then build an index and query it. Watch the search process unfold in real time.

0 points
K 5 nprobe 2
Recall vs Latency: IVF nprobe Sweep

Simulates how recall and query time change as you increase nprobe. The green "sweet spot" zone is where most production systems operate.

Highlight nprobe 20

Chapter 9: Connections — Where This Fits

Vector databases don't exist in isolation. They're the retrieval engine inside a larger pipeline. Here's how everything connects:

Embedding Model
Converts raw documents + queries into dense vectors (BERT, Sentence-Transformers, OpenAI Ada-002).
Vector Database
Stores and indexes embeddings. Retrieves top-k semantically similar documents given a query embedding.
LLM / Reranker
In RAG: LLM generates a response grounded in retrieved docs. Reranker (ColBERT, cross-encoder) refines the top-k before generation.
User
Receives a grounded, factual, up-to-date answer — even if the LLM never saw this document during training.
Why vector DBs matter for LLMs: LLMs are frozen at training time. A vector database lets you inject fresh knowledge at query time — product catalogs, recent news, internal docs — without retraining the model. This is RAG (Retrieval-Augmented Generation), and vector databases are its backbone.

Key Engineering Decisions in RAG

DecisionChoiceWhy
Chunk size256-512 tokensSmaller = more precise retrieval; larger = more context per chunk
Overlap10-20% of chunkPrevents losing context at chunk boundaries
Embedding modelDomain-specificGeneral models miss jargon; fine-tune on your data
Index typeHNSW for <10M, IVF+PQ for >10MRecall vs memory tradeoff
RerankingCross-encoder on top-50ANN recall isn't semantic relevance; reranker refines

What's Next

Vector databases are an active research area. DiskANN (Microsoft) extends HNSW to disk, enabling 100B-scale indexes with SSD. ScaNN (Google) uses asymmetric quantization for better recall at equal compression. SPANN combines IVF's cluster structure with graph navigation for billion-scale on a single machine. The core ideas — clustering, graph navigation, quantization — remain the same; the engineering gets more sophisticated.

The other frontier is multi-modal vector search: embedding text, images, audio, and video into a shared space (CLIP-style models) and searching across modalities. "Find images similar to this text description" or "find videos similar to this audio clip" — same indexes, different embeddings.

The philosophical point: A vector database turns the unstructured world — text, images, audio, code — into something SQL-queryable. The embedding model is the translation layer. The ANN index is the acceleration layer. Together, they're what lets modern AI systems retrieve relevant knowledge faster than any human could.
Related Lessons
Attention & Transformers GPT Architecture CLIP & Contrastive Learning
"The purpose of computing is insight, not numbers."
— Richard Hamming