Vector Embeddings — From Absolute Zero to Mastery

Chapter 0: Why Do We Need This?

You run an online bookstore. A customer searches for "gripping thriller with a twist ending." Your database has ten thousand books. How do you find the right ones?

The naive approach: scan every book title and description for those exact words. But what if a perfect match is described as "a nail-biting mystery that subverts expectations"? Exact keyword search misses it entirely. The meaning is there — the words aren't.

The same problem appears everywhere. A doctor searches for similar patient cases. A music app finds songs that sound like this one. An e-commerce site shows products like what you just viewed. A fraud system flags transactions similar to known fraud. In every case, you need to measure semantic similarity — closeness of meaning, not closeness of spelling.

The core problem: computers only understand numbers. Text, images, and audio are not numbers. To find "similar" things, we need a way to represent everything as numbers — but in a way that preserves meaning. Similar things should become similar numbers. That's what embeddings do.

The Similarity Problem, Visualized

Semantic Search Demo

These 8 items live in a space. Items that are "similar" cluster together. Click any item to highlight its nearest neighbors.

Click items to see semantic neighborhoods. This is what embeddings enable.

Key insight: if we could place every item at a point in space such that similar items end up near each other, then "find similar things" becomes "find nearby points." That's a solved problem. Embeddings are how we create that space.

What problem do embeddings solve?

Making computers run faster Compressing text files to save storage Representing items as numbers that preserve semantic similarity

Chapter 1: From Words to Numbers

Let's start with the most obvious approach: just assign each word a number. "apple" = 1, "banana" = 2, "cherry" = 3. Done, right?

No. This creates a lie. The number 1 is close to 2 but far from 100. But "apple" isn't closer to "banana" than it is to "mango." The numbers imply a fake ordering and fake distances. The model would hallucinate relationships that don't exist.

One-Hot Encoding: The Honest Baseline

The fix is one-hot encoding: give each word its own dedicated slot. With a vocabulary of N words, each word becomes a vector of N zeros with a single 1 in its slot.

One-Hot Encoder

Click a word to see its one-hot vector. Notice: every pair of words is exactly the same distance apart.

One-hot encoding is honest — it doesn't imply false similarity. But it has a crippling flaw: every word is equally far from every other word. The distance between "king" and "queen" is identical to the distance between "king" and "banana." The representation has no knowledge of meaning.

Bag of Words: Counting What's There

Bag of Words (BoW) represents a whole document as a count vector: how many times does each vocabulary word appear? "The cat sat on the mat" becomes a vector where "the" = 2, "cat" = 1, "sat" = 1, etc., and everything else = 0.

Documents with similar words get similar vectors. A cookbook and a recipe article will both have high counts for "cup," "tablespoon," "heat." This is progress — but word frequency isn't meaning. "Not good" and "good" have nearly identical BoW vectors, even though they're opposites.

TF-IDF: Weighting by Importance

TF-IDF (Term Frequency–Inverse Document Frequency) improves BoW by down-weighting common words like "the" and up-weighting rare, distinctive words like "photosynthesis." A word that appears everywhere is uninformative; a word that appears only in certain documents is a strong signal.

TF-IDF(word, doc) = (count of word in doc) × log(total docs / docs containing word)

TF-IDF is widely used in search engines and works well for keyword retrieval. But it inherits the core problem: "happy" and "joyful" are treated as completely unrelated, because they're different words in different vocabulary slots.

Method	Similarity Captured	Fatal Flaw
Integer ID	None (fake ordering)	Implies false numeric distance
One-Hot	None (all equidistant)	No relationship between words
Bag of Words	Shared words	Misses synonyms, ignores order
TF-IDF	Distinctive shared words	Still no semantic understanding

The fundamental problem: all of these methods treat words as arbitrary symbols. "King," "queen," "monarch," and "ruler" are four completely separate things, even though they're semantically neighbors. We need a representation that learns relationships from how words are actually used.

Why does one-hot encoding fail to capture word similarity?

It uses too much memory Every word vector is the same distance from every other word vector It can't represent words with multiple meanings

Chapter 2: Dense Vectors

In 2013, a Google team published a paper with a remarkable result. They trained a neural network to predict words from their context. A side effect: the network's internal representation of each word turned out to be geometrically meaningful.

They called the method Word2Vec. The key discovery: in the learned vector space, you could do arithmetic with meaning.

vector("king") − vector("man") + vector("woman") ≈ vector("queen")

The vector from "man" to "king" encodes the concept of royalty. Subtract that from "woman" and you get the queen. This isn't magic — it's what happens when you train a model to understand context.

How Training Creates Meaning

The training task is simple: given a word, predict its neighbors. Given "The ___ sat on the mat," predict "cat." Given "She is a ___ at Buckingham Palace," predict "queen." Words that appear in similar contexts will have similar gradients applied to their vectors, so they drift toward each other in vector space.

The distributional hypothesis: words that appear in similar contexts have similar meanings. "Dog" and "puppy" appear before "barked," "fetched," "leashed." Their contexts are similar, so their vectors become similar. No human labeled this — the model inferred it from raw text.

The Embedding Table

Concretely, Word2Vec maintains an embedding table: a matrix of shape (vocabulary_size × embedding_dim). For a 50,000-word vocabulary with 300-dimensional vectors, that's a 50,000 × 300 matrix — 15 million numbers.

To "embed" a word, you look up its row. No computation — just indexing. This lookup is so simple that it's just a matrix multiplication with a one-hot vector, but implemented as a table lookup for efficiency.

python
# A minimal embedding table in PyTorch
import torch
import torch.nn as nn

vocab_size = 50000
embed_dim  = 300

# The table: 50k words × 300 dimensions = 15M learnable numbers
embedding = nn.Embedding(vocab_size, embed_dim)

# Look up word ID 42 ("king")
king_id = torch.tensor([42])
king_vec = embedding(king_id)   # → tensor of shape [1, 300]

# The king-man+woman analogy
result = embedding(torch.tensor([42]))   # king
result -= embedding(torch.tensor([17]))  # man
result += embedding(torch.tensor([29]))  # woman
# result ≈ queen vector (find nearest neighbor to confirm)

Exploring the Vector Space

Word Vector Analogy Explorer

Simulated 2D word vectors showing semantic structure. Drag the sliders to see how vector arithmetic traces through meaning space.

Arrows show the "royalty" vector. Adding it to "woman" lands near "queen."

Word2Vec vectors have another property: the dimension of the embedding (300, 768, 1536) controls how much information each word can carry. A 2D embedding is easy to visualize but can only encode two independent concepts. A 768D embedding can encode hundreds of nuanced properties.

Word2Vec learns word vectors by training on which task?

Manually labeling word similarity scores Classifying words into grammatical categories Predicting a word from its surrounding context words

Chapter 3: Sentence & Document Embeddings

Word2Vec gives you a vector per word. But most real tasks need a vector per sentence or per document. "The bank is by the river" and "I deposited money at the bank" have the same word "bank" — but in completely different senses. You need the full sentence to get the right meaning.

Mean Pooling: Simple but Surprisingly Strong

The simplest approach: embed every word, then average the vectors. For a 5-word sentence, get 5 vectors and average them element-by-element. The result is one vector representing the whole sentence.

sentence_vec = (vec₁ + vec₂ + ... + vec_N) / N

This is called mean pooling. It works surprisingly well for topics and themes. But it loses word order completely — "dog bites man" and "man bites dog" get identical vectors.

The CLS Token: Let the Model Summarize

Transformer models like BERT add a special [CLS] token (short for "classification") at the start of every sentence. The model is trained so that this token's final hidden state summarizes the entire sentence. Rather than averaging word vectors, you just read off the CLS token's output.

Input

[CLS] "The bank is by the river" [SEP]

↓ Transformer (12 layers of attention)

Output

One vector per token, but [CLS] = sentence summary

↓ Read off CLS

Embedding

768-dimensional sentence vector

Bi-Encoders: Built for Similarity

BERT's CLS token wasn't originally trained to produce good similarity embeddings. For search and retrieval, we use bi-encoders (also called dual encoders): two identical encoder networks that independently embed a query and a document, then compare them with cosine similarity.

Training a bi-encoder: show it pairs of semantically similar sentences (positives) and dissimilar ones (negatives). Train it to push positives close together and pull negatives apart. This is the contrastive learning objective.

Real Embedding Models

Model	Dimensions	Strengths
OpenAI text-embedding-3-small	1536	General purpose, fast, cheap
OpenAI text-embedding-3-large	3072	Highest quality, supports truncation
Cohere embed-v3	1024	Multilingual, task-specific types
BGE-large-en	1024	Open-source, strong benchmark scores
E5-mistral-7b	4096	LLM-based, best for complex queries

Sentence Similarity Visualizer

Click any sentence pair to see how similar they are under mean pooling vs a trained bi-encoder.

Why bi-encoders for search: at query time, you need to compare one query against millions of documents instantly. With a bi-encoder, you pre-compute all document embeddings offline. At query time, embed only the query (fast), then find nearest neighbors (fast). This would be impossible with a cross-encoder that sees both texts together.

Why do bi-encoders pre-compute document embeddings offline?

Documents are too long to embed at query time The encoder only works on documents, not queries So that at query time you only need to embed the query, not millions of documents

Chapter 4: The Geometry of Meaning

You have two sentence vectors. How do you measure whether they're similar? The obvious answer is Euclidean distance — how far apart are the points? But for embeddings, a different measure works better: cosine similarity.

What Cosine Similarity Actually Measures

Cosine similarity measures the angle between two vectors, ignoring their lengths. Two vectors pointing in the same direction have cosine similarity 1.0. Perpendicular vectors: 0. Opposite directions: −1.0.

cos(A, B) = (A · B) / (|A| × |B|)

Why ignore length? Because the magnitude of an embedding often reflects text length, not semantic content. A long paragraph about cats and a short sentence about cats should be considered similar, even if their raw vectors have very different magnitudes.

Cosine Similarity Explorer

Drag the sliders to rotate the vectors. Watch how cosine similarity (the angle) changes independently of vector length.

Vector A angle 30°

Vector B angle 70°

Vector A length 1.0

      Cosine similarity: —
    

Why High Dimensions Work

Intuitively, high-dimensional spaces seem harder. But for embeddings, more dimensions are better. The Johnson-Lindenstrauss lemma tells us something remarkable: N points in high-dimensional space can be projected down to roughly log(N) dimensions while preserving their pairwise distances. Real embedding models exploit this — 768 dimensions can faithfully encode tens of thousands of distinct concepts.

There's also a geometric benefit: in high dimensions, most vectors are nearly orthogonal by chance. This means the embedding space has lots of "room" to place distinct concepts far apart, leaving only truly similar concepts close together.

Clustering and Semantic Search

Once you have embeddings, you can use them for three key operations:

Nearest Neighbor Search
Given a query vector, find the K vectors in your database closest to it. This is semantic search. Used in RAG, recommendation, and duplicate detection.

Clustering
Group embeddings by distance (K-means, DBSCAN). Documents in the same cluster discuss the same topic — without any labels. Used for topic modeling and dataset organization.

Analogies as vector arithmetic: if embeddings are geometrically consistent, then relationships between concepts correspond to consistent direction vectors. "Capital city of" is a direction: Paris − France ≈ Berlin − Germany ≈ Tokyo − Japan. This only works because the geometry is meaningful — not a coincidence, but a consequence of how the model was trained.

Why is cosine similarity preferred over Euclidean distance for comparing embeddings?

Cosine similarity is faster to compute Euclidean distance doesn't work in high dimensions It measures angle (semantic direction) and ignores vector length, which can vary with text length

Chapter 5: Embedding Models in Practice

Knowing what embeddings are is one thing. Using them in production is another. Here's what actually matters when you reach for an embedding API.

Making an API Call

python
from openai import OpenAI

client = OpenAI()

# Embed a single sentence
response = client.embeddings.create(
    model="text-embedding-3-small",
    input="The quick brown fox jumps over the lazy dog"
)
vec = response.data[0].embedding  # list of 1536 floats

# Embed a batch of sentences (much more efficient)
sentences = [
    "Paris is the capital of France",
    "Berlin is Germany's largest city",
    "I enjoy eating pizza",
]
response = client.embeddings.create(
    model="text-embedding-3-small",
    input=sentences
)
vecs = [d.embedding for d in response.data]  # list of 3 vectors

# Cosine similarity between two embeddings
import numpy as np

def cosine_sim(a, b):
    a, b = np.array(a), np.array(b)
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

sim = cosine_sim(vecs[0], vecs[1])  # ~0.92 (both about European capitals)
sim = cosine_sim(vecs[0], vecs[2])  # ~0.31 (unrelated topics)

Dimensionality: What Size Do You Need?

More dimensions = more expressive, but also more storage and slower nearest-neighbor search. A 1536-dim vector costs twice the storage and compute of a 768-dim one. For most applications, 768 or 1024 dimensions is plenty. Only go higher for nuanced retrieval tasks.

Matryoshka Embeddings

OpenAI's text-embedding-3 models support Matryoshka Representation Learning (MRL) — named after Russian nesting dolls. The model is trained so that the first 256 dimensions alone are a good embedding. The first 512 are better. All 1536 are best. You can truncate the vector at any point and still get a useful embedding.

This means you can trade quality for speed by shortening the vector at query time. Use 256 dimensions for a rough pre-filter, then re-rank with full 1536 dimensions only for the top candidates.

python
# Using Matryoshka: truncate to 256 dims
response = client.embeddings.create(
    model="text-embedding-3-small",
    input="my query",
    dimensions=256  # truncate and re-normalize internally
)
# Same model, 6x smaller vectors, ~90% of the quality

Cost and Caching Strategy

Model	Cost per million tokens	Typical doc tokens	Cost per 100k docs
text-embedding-3-small	$0.02	~200	$0.40
text-embedding-3-large	$0.13	~200	$2.60
BGE (self-hosted)	~$0.001	~200	$0.02

Documents change rarely; queries come constantly. The right strategy: embed all documents once and store them. At query time, embed only the query. If the same query appears multiple times, cache its embedding in Redis or memory.

Batching is critical. Embedding 1000 sentences one at a time incurs 1000 API calls with 1000 latency round-trips. Batching all 1000 into one call takes a single round-trip and often costs less due to server-side batching efficiency. Always batch when processing datasets.

What is the key advantage of Matryoshka embeddings?

They work better on Russian language text You can truncate them to shorter vectors and still get useful embeddings They automatically update when document content changes

Chapter 6: Multimodal Embeddings

So far, every example has been text. But what if you want to search for images using text? Or find products visually similar to a photo you took? This requires putting images and text into the same vector space.

CLIP: One Space for Everything

In 2021, OpenAI released CLIP (Contrastive Language-Image Pre-training). The key idea: train two separate encoders — one for images, one for text — so that matching image-text pairs end up near each other in a shared embedding space.

Training data: 400 million (image, caption) pairs scraped from the internet. For each batch, CLIP sees N images and N captions. The correct image-caption pairs should have high cosine similarity. All N² wrong pairings should have low similarity. This is the contrastive objective.

CLIP Shared Embedding Space

Images and text captions land in the same space. Matching pairs cluster together. Click to highlight a pair.

Zero-Shot Classification with CLIP

Here's where it gets powerful. Suppose you want to classify an image as "cat" or "dog" but you never trained a classifier. With CLIP, you embed the image and embed the text labels ("a photo of a cat," "a photo of a dog"). Whichever text embedding is closest to the image embedding wins.

python
import torch
from transformers import CLIPProcessor, CLIPModel
from PIL import Image

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

image = Image.open("my_image.jpg")
labels = ["a photo of a cat", "a photo of a dog", "a photo of a car"]

inputs = processor(text=labels, images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)

# Cosine similarity between image and each text label
logits = outputs.logits_per_image  # shape: [1, 3]
probs = logits.softmax(dim=-1)    # → [0.91, 0.07, 0.02] — it's a cat!

You never trained on these specific classes. The model generalizes because the text encodings capture semantics, and the image encodings capture visual features — and training aligned them.

Beyond CLIP: The Multimodal Frontier

CLIP proves the concept, but modern systems go further. ImageBind (Meta, 2023) embeds six modalities — images, text, audio, depth, thermal, and IMU data — into a single shared space. Train on image-audio pairs and image-text pairs, and you implicitly learn audio-text similarity even without direct audio-text training pairs. The shared space does the work.

The power of a shared space: once images and text live in the same embedding space, you can do image retrieval with text queries, text retrieval with image queries, and cluster mixed datasets. The embedding space becomes a universal language for meaning — regardless of modality.

How does CLIP perform zero-shot image classification?

It looks up images in a database of labeled examples It embeds the image and text labels, then finds the text label closest to the image in embedding space It generates a text description of the image and compares descriptions

Chapter 7: When Embeddings Fail

Embeddings are powerful but not infallible. Every production embedding system has failure modes. Knowing them upfront saves you from debugging at 2 AM.

Domain Mismatch

A general-purpose embedding model trained on web text has never seen your company's internal jargon. "JIRA-4521 blocks the Q3 OKR metric" is meaningless to it. The model will embed "JIRA-4521" as a random-ish vector because it never appeared in training data. Domain mismatch is the most common production failure.

Fix: fine-tune the embedding model on your domain's data using contrastive loss on pairs you know are similar/dissimilar. Or use a retrieval model that's been specialized for your domain.

Negation is Invisible

This one is subtle and dangerous. The sentences "The product is good" and "The product is not good" have nearly identical embeddings. The word "not" is short and common — it doesn't change the semantic neighborhood much. Bag-of-words-style thinking bleeds into neural embeddings too.

Negation Failure Demo

Watch how cosine similarity changes (or doesn't) when we add negation.

Short vs Long Text

Mean pooling averages word vectors. A 500-word article's embedding is an average of 500 vectors, each contributing 0.2%. Key concepts get diluted. Short texts (1-2 sentences) are much more focused — their embeddings are sharper and more discriminative. Long texts need chunking: split into 512-token windows, embed each, then store all chunks. At query time, retrieve chunks, not full documents.

The Stale Embedding Problem

You embedded your product catalog in January. In March, you added 10,000 new products. In June, product descriptions were rewritten. Now your embeddings are stale — they don't reflect the current state. Semantic search over stale embeddings will miss or misrank items.

Fix: treat embeddings like a cache. Invalidate and re-embed when source documents change. Keep a hash of the source text; if the hash changes, re-embed. For large corpora, re-embed on a schedule (nightly or weekly).

Semantic Drift in Time

"Cloud" in 2005 meant weather. In 2025 it means AWS. An embedding model trained on 2005 text will embed "cloud computing" near "meteorology." Language meaning shifts, but model weights don't update automatically.

Failure Mode	Symptom	Fix
Domain mismatch	Jargon embeddings are random-ish	Fine-tune on domain data
Negation blindness	"not good" ≈ "good"	Use a cross-encoder for re-ranking
Long text dilution	Key concepts lost in average	Chunk documents, store per-chunk
Stale embeddings	New/updated docs not found	Re-embed on change detection
Semantic drift	Old model, shifted language	Retrain or update model periodically

The golden rule for production: never treat embeddings as ground truth. Always have a re-ranking step — a cross-encoder that sees both query and document together — to catch the cases where embedding similarity was misleading. Bi-encoder retrieval for speed, cross-encoder re-ranking for accuracy.

Why do embedding models struggle with negation like "not good"?

Neural networks can't process negative numbers "Not" is too long a word to embed accurately "Not" is common and short, so adding it barely shifts the embedding, leaving it near the positive version

Chapter 8: Showcase — Interactive Embedding Explorer

Everything comes together here. Type two sentences, watch their simulated embedding vectors projected into 2D space, and see cosine similarity update live. Compare how one-hot encoding, bag-of-words, and dense embeddings represent the same sentences differently.

Embedding Explorer

SENTENCE A

SENTENCE B

METHOD
Dense
COSINE SIM
—
VERDICT
—

Try: "I love cats" vs "I hate cats" — watch how dense embeddings still place them near each other (negation problem). Try: "Paris is beautiful" vs "France is lovely" — dense embeddings catch the semantic link that one-hot misses entirely.

Compare the Methods Side by Side

Method Comparison Canvas

All three methods applied to the same sentence pair. See how the similarity scores differ.

Updates with text from the explorer above

Chapter 9: Connections — The Foundation Beneath Everything

Embeddings don't exist in isolation. They're the foundation beneath nearly every modern AI application. Here's how they connect to the larger landscape.

The Stack

Text / Images / Audio

Raw unstructured data — not yet searchable by meaning

↓ Embedding Model

Dense Vectors

Meaning encoded as numbers — similar things are numerically close

↓ Vector Database (Pinecone, Weaviate, pgvector)

Indexed Embeddings

Approximate nearest neighbor (ANN) index — billions of vectors, millisecond search

↓ Retrieval at Query Time

Relevant Chunks

The K most semantically similar documents to the query

↓ LLM with Retrieved Context

RAG Answer

Grounded, factual, up-to-date response — the output the user sees

Where Embeddings Power Modern AI

Application	What Gets Embedded	What the Search Finds
RAG (ChatGPT Plugins)	Document chunks	Relevant context for the query
Semantic Search (Notion, Linear)	Pages, issues	Results matching query intent
Recommendation (Spotify, Netflix)	Songs, shows, user history	Items close to what you liked
Duplicate Detection	Bug reports, support tickets	Already-filed duplicates
Anomaly Detection	Transactions, log lines	Outliers far from the normal cluster
Cross-modal Search (Pinterest)	Images + text	Images matching a text query

Approximate Nearest Neighbor Algorithms

Exact nearest neighbor search over 100 million vectors is too slow. Real systems use Approximate Nearest Neighbor (ANN) algorithms that trade a small amount of accuracy for massive speed gains.

Two dominant approaches: HNSW (Hierarchical Navigable Small World) builds a graph where nearby vectors are connected. Searching means hopping through the graph — fast but memory-intensive. IVF (Inverted File Index) clusters vectors into groups; at query time, search only the closest groups, not all of them.

Related Lessons

Next steps in this direction:
Contrastive Learning & CLIP — how contrastive objectives train shared embedding spaces

Transformers — the architecture that produces modern sentence embeddings

Attention & Self-Attention — how context-aware representations are built

Applications that use embeddings:
Vision-Language Models — CLIP embeddings at the core

Reward Alignment — embedding human preferences

NeRF & 3D Gaussian Splatting — positional embeddings in 3D space

"The embedding is where meaning lives. Everything else is search."
— A useful way to think about modern AI systems

What you can now build: a semantic search engine over any text corpus — embed your documents, store vectors in PostgreSQL with pgvector, embed queries at search time, return top-K by cosine similarity. This is the core of every RAG system in production today. The hard part isn't the embeddings — it's all the failure modes you now know to watch for.