How to find the 10 most similar embeddings among a million — in under 50 ms.
You've built a document search engine. You embedded 1,000,000 product descriptions using a language model — each document is now a list of 768 floating-point numbers. A user types a query, you embed that query too, and you want to return the 10 most similar products.
The obvious approach: compare the query's vector to every one of the million vectors, compute a similarity score for each, then sort and return the top 10. That's called brute-force search, or exact nearest-neighbor search.
It works perfectly when you have 1,000 documents. At 1,000,000 documents? You're computing 1,000,000 dot products, each over 768 dimensions. That's 768 million multiplications per query. On a modern CPU, this takes ~500 ms. Your users wanted results in under 50 ms.
This is what a vector database solves. It's an index structure that lets you find approximate nearest neighbors in O(log n) or even O(1) time, trading a tiny bit of recall for massive speedup. The rest of this lesson walks through every major approach — how they work, why they're fast, and what they sacrifice.
First, let's visualize the brute-force problem. Below are 200 random 2D points (imagine these are 768-dim embeddings squashed to 2D). The query is the orange dot. Brute force must check all gray lines before returning the winners.
Click anywhere to set the query point. Watch how every single point must be examined.
Before we optimize anything, we need to know exactly what we're optimizing. Brute-force search is perfect — it always returns the true nearest neighbors, with zero error. It's also the benchmark that every approximate method is measured against.
The most common similarity measure for embedding vectors is cosine similarity: how aligned are two vectors in direction? Two vectors pointing in the same direction score 1.0; perpendicular vectors score 0; opposite directions score -1. For normalized vectors, cosine similarity equals the dot product.
FAISS (Facebook AI Similarity Search) is the gold standard library for this. Here's the complete brute-force pipeline — it truly is this simple:
python import faiss import numpy as np d = 768 # dimension (e.g. BERT embedding size) n = 1_000_000 # number of documents # Build the index — FlatL2 = exact Euclidean distance index = faiss.IndexFlatL2(d) # Add all your vectors (must be float32, shape [n, d]) vectors = np.random.randn(n, d).astype(np.float32) faiss.normalize_L2(vectors) # convert L2 to cosine similarity index.add(vectors) # O(n·d) but only done once # Query — returns k nearest neighbors for each query query = np.random.randn(1, d).astype(np.float32) faiss.normalize_L2(query) distances, indices = index.search(query, k=10) # O(n·d) per query # indices[0] = [idx1, idx2, ...idx10] — the top-10 results # distances[0] = [0.02, 0.05, ...] — their distances
The performance numbers at scale are unambiguous:
| Dataset Size | Brute Force Time | Usable? |
|---|---|---|
| 10K vectors | ~0.5 ms | Yes |
| 100K vectors | ~5 ms | Yes |
| 1M vectors | ~500 ms | Marginal |
| 10M vectors | ~5 s | No |
| 1B vectors | ~8 min | Never |
Each bar shows the simulated query time as dataset size grows. The relationship is perfectly linear.
Here's the key insight behind Inverted File Index (IVF): if your vectors cluster in semantic space (and they do — similar documents land near each other), you don't need to search all clusters. Only search the ones near your query.
The build phase is a one-time k-means clustering over all your vectors. Say you create 1000 clusters (called centroids). Each vector is assigned to its nearest centroid and stored in that centroid's "posting list." Think of it like a library sorted by genre — when you want a mystery novel, you go straight to the mystery section, not every shelf.
The critical parameter is nprobe: how many clusters to search. With nprobe=1 you're lightning fast but may miss true neighbors. With nprobe=nlist you've reproduced brute force. The art is finding the sweet spot.
python import faiss d = 768 nlist = 1000 # number of clusters (√n is a good rule of thumb) # Build quantizer = faiss.IndexFlatL2(d) # find nearest centroid index = faiss.IndexIVFFlat(quantizer, d, nlist, faiss.METRIC_L2) index.train(vectors) # k-means: O(n · d · nlist · iterations) index.add(vectors) # assign each vector to nearest centroid # Query: search only 50 of the 1000 clusters index.nprobe = 50 # tune this: higher = better recall, slower distances, indices = index.search(query, k=10) # Speed: ~50x faster than brute force (50/1000 clusters searched) # Recall: ~90-95% at nprobe=50 (some true top-10 may be missed)
Adjust nprobe to see which clusters get searched (highlighted). More clusters = better recall, slower query.
IVF works well, but it has a problem: during build, vectors on the boundary of a cluster may have their true neighbors in the adjacent cluster. If that cluster isn't in your nprobe set, you miss them. What if, instead of clusters, you built a network of shortcuts — a graph where you could navigate from anywhere to anywhere in logarithmic steps?
That's Hierarchical Navigable Small World (HNSW). The name is a mouthful, but the idea is elegant. Think of it like a skip list for vectors:
Building HNSW:
Querying HNSW: start at the top level's entry point. Greedily move to your nearest neighbor. Drop to the next level. Repeat. At level 0, do a beam search with ef candidates. Return the top-k.
python import faiss d = 768 M = 32 # connections per node (higher = better recall, more RAM) ef_construction = 200 # search width during build (higher = better index, slower build) index = faiss.IndexHNSWFlat(d, M) index.hnsw.efConstruction = ef_construction # Build: O(n · M · log(n)) — much slower than IVF build index.add(vectors) # no train() needed # Query: adjust ef for recall vs speed index.hnsw.efSearch = 64 # beam width at query time distances, indices = index.search(query, k=10) # Recall: ~99% at efSearch=64; Speed: sub-millisecond per query
| Property | IVF | HNSW |
|---|---|---|
| Build time | Fast (k-means) | Slow (graph construction) |
| Query time | Fast | Very fast |
| Recall | Good (nprobe-tunable) | Excellent (graph structure) |
| Memory | Low (just stores vectors + lists) | Higher (stores graph edges) |
| Supports deletes? | No (requires rebuild) | No (marks as deleted) |
Watch how the search starts at the top layer (few nodes, long jumps) and refines at the bottom (many nodes, local search). Click "Run Query" to animate a search.
HNSW and IVF solve the speed problem by reducing how many vectors you compare. But they don't solve the memory problem. A million 768-dim float32 vectors takes 768 × 4 × 1,000,000 = 3 GB of RAM. That's just the vectors — the graph edges in HNSW can add another 2-5 GB.
Product Quantization (PQ) is the compression layer. It reduces each 768-dim vector to ~64 bytes (down from 3072 bytes), a 48× compression, with only a small loss in recall accuracy.
Here's how it works. Split the 768-dim vector into M subvectors of d/M dimensions each. For each subspace, train a small k-means codebook with K codewords (typically K=256). Replace each subvector with the index of its nearest codeword (1 byte per subspace, since 2⁸=256).
Distance computation becomes a table lookup. You precompute the distance from the query to every codeword in every subspace (M × K lookups), then for any database vector, its approximate distance is just M table lookups and M additions. Massively faster than a full dot product over 768 dims.
python import faiss d = 768 M = 12 # number of subvectors; d must be divisible by M nbits = 8 # 8 bits = 256 codewords per subspace # PQ-only index (good for memory, not as fast as HNSW+PQ) index = faiss.IndexPQ(d, M, nbits) index.train(vectors) # learns M × 256 codewords (needs ~50K samples) index.add(vectors) # compresses each vector to M bytes distances, indices = index.search(query, k=10) # Memory: n × M bytes = 1M × 12 = 12 MB (vs 3 GB for FlatL2) # Recall: ~85-90% at top-10 with these settings
A 2D vector gets split into subvectors and snapped to the nearest codebook centroid. The error between original and reconstructed is the quantization loss.
IVF reduces how many vectors you compare. PQ reduces how expensive each comparison is. Combine them and you get the two most important indexes in production: IVF+PQ and HNSW+PQ.
IVF+PQ (also written IVFPQ) is the workhorse of billion-scale search. You cluster with IVF to narrow to a few posting lists, then use PQ's lookup table trick to score those vectors cheaply. The combined speedup is multiplicative: if IVF eliminates 99% of vectors and PQ makes each remaining comparison 50× faster, you're 5000× faster than brute force.
IndexIVFPQ = IVF + PQ. IndexHNSWPQ = HNSW + PQ. The factory string system lets you describe arbitrarily complex indexes as a readable recipe.
python import faiss d = 768 # Method 1: explicit construction nlist = 4096 # clusters M = 16 # PQ subvectors (each 768/16 = 48 dims) nbits = 8 # 256 codewords quantizer = faiss.IndexFlatL2(d) index = faiss.IndexIVFPQ(quantizer, d, nlist, M, nbits) index.train(training_vectors) # needs ~100K samples minimum index.add(vectors) index.nprobe = 64 # tune: try 16, 32, 64, 128 distances, indices = index.search(query, k=10) # Method 2: index factory string — same result index2 = faiss.index_factory(d, "IVF4096,PQ16") # HNSW + PQ: great recall but harder to tune index3 = faiss.index_factory(d, "HNSW32,PQ16") # With flat reranking (rescore top-4k with exact distances): index4 = faiss.index_factory(d, "IVF4096,PQ16,RFlat")
| Index | Recall@10 | QPS (1M, 1 thread) | RAM |
|---|---|---|---|
| FlatL2 (brute force) | 100% | ~2 | 3 GB |
| IVF4096,Flat | ~95% | ~400 | 3 GB |
| IVF4096,PQ16 | ~85% | ~2,000 | 50 MB |
| HNSW32 | ~99% | ~1,000 | 5 GB |
| HNSW32,PQ16 | ~95% | ~800 | 250 MB |
Compare how each index type occupies the recall-latency space. Higher and left is better.
FAISS is a C++ library — you manage your own memory, write your own serialization, and handle updates manually. A vector database wraps these indexes in a production-grade system that handles persistence, filtering, replication, and APIs.
The key capability that separates a vector DB from a raw FAISS index is metadata filtering. Your query isn't just "find the 10 nearest vectors" — it's "find the 10 nearest vectors where product_category='shoes' AND price < 100 AND in_stock=true." Vector DBs combine ANN search with structured filters.
The major players:
| System | Backend Index | Filtering | Best For |
|---|---|---|---|
| Pinecone | Proprietary (HNSW-based) | Pre-filter | Fully managed, easy start |
| Weaviate | HNSW | Pre-filter + post | Hybrid BM25+vector search |
| Chroma | HNSW (hnswlib) | Post-filter | Local dev, RAG prototypes |
| Qdrant | HNSW | Pre-filter | Payload filtering, Rust perf |
| pgvector | IVF or HNSW | PostgreSQL WHERE | Existing Postgres users |
| Milvus | IVF, HNSW, DiskANN | Pre-filter | High-throughput, cloud-native |
Hybrid search is the other key feature: combining sparse keyword matching (BM25/TF-IDF) with dense vector similarity. A query like "apple iPhone battery replacement" benefits from both: the BM25 component exact-matches "iPhone" and "battery," while the vector component captures semantic similarity to related product descriptions. Results are merged using Reciprocal Rank Fusion (RRF) or a weighted score.
python # Chroma — the simplest way to start import chromadb client = chromadb.Client() collection = client.create_collection("products") # Insert: pass raw documents, Chroma handles embedding collection.add( documents=["Red running shoes for marathon", "Nike Air Max 90"], metadatas=[{"category": "shoes", "price": 89}, {"category": "shoes", "price": 130}], ids=["doc1", "doc2"] ) # Query with metadata filter results = collection.query( query_texts=["comfortable running shoes under $100"], n_results=5, where={"price": {"$lt": 100}} # metadata filter )
Getting an ANN index to work on a laptop is the easy part. Running it reliably at scale is where the real engineering begins.
Never add vectors one at a time. FAISS's index.add() is not thread-safe, and IVF indexes need to be retrained if the data distribution shifts significantly. The standard pattern: accumulate updates in a buffer (Redis queue, S3 staging area), then bulk-ingest in off-peak batches. For HNSW you can add online, but build time grows super-linearly — pre-build offline and serve the frozen index.
FAISS doesn't support deletes — there's no "remove vector 47382" operation. Vector DBs handle this in two ways: soft deletes (mark as deleted in metadata, filter from results) and periodic reindex (rebuild the index nightly from the canonical database, swapping the live index atomically). Most production systems use both.
ANN recall degrades silently as your data distribution shifts. You add new product categories, your users start querying differently, your embedding model gets updated. The fix: hold out a small ground-truth evaluation set (1,000 queries with known true neighbors), and run it against your live index daily. Alert if recall drops below threshold.
python def measure_recall(index, ground_truth_queries, true_neighbors, k=10): """Recall@k: fraction of true top-k that appear in ANN top-k""" _, ann_results = index.search(ground_truth_queries, k) recalls = [] for ann, true in zip(ann_results, true_neighbors): hits = len(set(ann) & set(true[:k])) recalls.append(hits / k) return np.mean(recalls) # Run daily, alert if recall drops below 0.90 recall = measure_recall(index, eval_queries, gt_neighbors) if recall < 0.90: alert(f"Recall dropped to {recall:.2%} — consider retuning nprobe")
In a SaaS product, each customer's data must be isolated — a search for customer A must never return customer B's documents. The two approaches: namespace isolation (one index per tenant — simple but doesn't scale past a few thousand tenants) and metadata partitioning (one shared index with tenant_id as a mandatory pre-filter). The metadata approach scales better but requires the vector DB to support efficient pre-filtering.
Simulate recall over time as data distribution shifts. See how a reindex event (↑) restores recall.
Everything in one place. Insert 2D points into a dataset, build an IVF or HNSW index over them, then issue a query and watch the search path unfold. You'll see exactly which points get examined — and which get skipped.
Click the canvas to add data points. Then build an index and query it. Watch the search process unfold in real time.
Simulates how recall and query time change as you increase nprobe. The green "sweet spot" zone is where most production systems operate.
Vector databases don't exist in isolation. They're the retrieval engine inside a larger pipeline. Here's how everything connects:
| Decision | Choice | Why |
|---|---|---|
| Chunk size | 256-512 tokens | Smaller = more precise retrieval; larger = more context per chunk |
| Overlap | 10-20% of chunk | Prevents losing context at chunk boundaries |
| Embedding model | Domain-specific | General models miss jargon; fine-tune on your data |
| Index type | HNSW for <10M, IVF+PQ for >10M | Recall vs memory tradeoff |
| Reranking | Cross-encoder on top-50 | ANN recall isn't semantic relevance; reranker refines |
Vector databases are an active research area. DiskANN (Microsoft) extends HNSW to disk, enabling 100B-scale indexes with SSD. ScaNN (Google) uses asymmetric quantization for better recall at equal compression. SPANN combines IVF's cluster structure with graph navigation for billion-scale on a single machine. The core ideas — clustering, graph navigation, quantization — remain the same; the engineering gets more sophisticated.
The other frontier is multi-modal vector search: embedding text, images, audio, and video into a shared space (CLIP-style models) and searching across modalities. "Find images similar to this text description" or "find videos similar to this audio clip" — same indexes, different embeddings.