How to turn words, images, and products into numbers that understand meaning.
You run an online bookstore. A customer searches for "gripping thriller with a twist ending." Your database has ten thousand books. How do you find the right ones?
The naive approach: scan every book title and description for those exact words. But what if a perfect match is described as "a nail-biting mystery that subverts expectations"? Exact keyword search misses it entirely. The meaning is there — the words aren't.
The same problem appears everywhere. A doctor searches for similar patient cases. A music app finds songs that sound like this one. An e-commerce site shows products like what you just viewed. A fraud system flags transactions similar to known fraud. In every case, you need to measure semantic similarity — closeness of meaning, not closeness of spelling.
These 8 items live in a space. Items that are "similar" cluster together. Click any item to highlight its nearest neighbors.
Click items to see semantic neighborhoods. This is what embeddings enable.
Let's start with the most obvious approach: just assign each word a number. "apple" = 1, "banana" = 2, "cherry" = 3. Done, right?
No. This creates a lie. The number 1 is close to 2 but far from 100. But "apple" isn't closer to "banana" than it is to "mango." The numbers imply a fake ordering and fake distances. The model would hallucinate relationships that don't exist.
The fix is one-hot encoding: give each word its own dedicated slot. With a vocabulary of N words, each word becomes a vector of N zeros with a single 1 in its slot.
Click a word to see its one-hot vector. Notice: every pair of words is exactly the same distance apart.
One-hot encoding is honest — it doesn't imply false similarity. But it has a crippling flaw: every word is equally far from every other word. The distance between "king" and "queen" is identical to the distance between "king" and "banana." The representation has no knowledge of meaning.
Bag of Words (BoW) represents a whole document as a count vector: how many times does each vocabulary word appear? "The cat sat on the mat" becomes a vector where "the" = 2, "cat" = 1, "sat" = 1, etc., and everything else = 0.
Documents with similar words get similar vectors. A cookbook and a recipe article will both have high counts for "cup," "tablespoon," "heat." This is progress — but word frequency isn't meaning. "Not good" and "good" have nearly identical BoW vectors, even though they're opposites.
TF-IDF (Term Frequency–Inverse Document Frequency) improves BoW by down-weighting common words like "the" and up-weighting rare, distinctive words like "photosynthesis." A word that appears everywhere is uninformative; a word that appears only in certain documents is a strong signal.
TF-IDF is widely used in search engines and works well for keyword retrieval. But it inherits the core problem: "happy" and "joyful" are treated as completely unrelated, because they're different words in different vocabulary slots.
| Method | Similarity Captured | Fatal Flaw |
|---|---|---|
| Integer ID | None (fake ordering) | Implies false numeric distance |
| One-Hot | None (all equidistant) | No relationship between words |
| Bag of Words | Shared words | Misses synonyms, ignores order |
| TF-IDF | Distinctive shared words | Still no semantic understanding |
In 2013, a Google team published a paper with a remarkable result. They trained a neural network to predict words from their context. A side effect: the network's internal representation of each word turned out to be geometrically meaningful.
They called the method Word2Vec. The key discovery: in the learned vector space, you could do arithmetic with meaning.
The vector from "man" to "king" encodes the concept of royalty. Subtract that from "woman" and you get the queen. This isn't magic — it's what happens when you train a model to understand context.
The training task is simple: given a word, predict its neighbors. Given "The ___ sat on the mat," predict "cat." Given "She is a ___ at Buckingham Palace," predict "queen." Words that appear in similar contexts will have similar gradients applied to their vectors, so they drift toward each other in vector space.
Concretely, Word2Vec maintains an embedding table: a matrix of shape (vocabulary_size × embedding_dim). For a 50,000-word vocabulary with 300-dimensional vectors, that's a 50,000 × 300 matrix — 15 million numbers.
To "embed" a word, you look up its row. No computation — just indexing. This lookup is so simple that it's just a matrix multiplication with a one-hot vector, but implemented as a table lookup for efficiency.
python # A minimal embedding table in PyTorch import torch import torch.nn as nn vocab_size = 50000 embed_dim = 300 # The table: 50k words × 300 dimensions = 15M learnable numbers embedding = nn.Embedding(vocab_size, embed_dim) # Look up word ID 42 ("king") king_id = torch.tensor([42]) king_vec = embedding(king_id) # → tensor of shape [1, 300] # The king-man+woman analogy result = embedding(torch.tensor([42])) # king result -= embedding(torch.tensor([17])) # man result += embedding(torch.tensor([29])) # woman # result ≈ queen vector (find nearest neighbor to confirm)
Simulated 2D word vectors showing semantic structure. Drag the sliders to see how vector arithmetic traces through meaning space.
Arrows show the "royalty" vector. Adding it to "woman" lands near "queen."
Word2Vec vectors have another property: the dimension of the embedding (300, 768, 1536) controls how much information each word can carry. A 2D embedding is easy to visualize but can only encode two independent concepts. A 768D embedding can encode hundreds of nuanced properties.
Word2Vec gives you a vector per word. But most real tasks need a vector per sentence or per document. "The bank is by the river" and "I deposited money at the bank" have the same word "bank" — but in completely different senses. You need the full sentence to get the right meaning.
The simplest approach: embed every word, then average the vectors. For a 5-word sentence, get 5 vectors and average them element-by-element. The result is one vector representing the whole sentence.
This is called mean pooling. It works surprisingly well for topics and themes. But it loses word order completely — "dog bites man" and "man bites dog" get identical vectors.
Transformer models like BERT add a special [CLS] token (short for "classification") at the start of every sentence. The model is trained so that this token's final hidden state summarizes the entire sentence. Rather than averaging word vectors, you just read off the CLS token's output.
BERT's CLS token wasn't originally trained to produce good similarity embeddings. For search and retrieval, we use bi-encoders (also called dual encoders): two identical encoder networks that independently embed a query and a document, then compare them with cosine similarity.
Training a bi-encoder: show it pairs of semantically similar sentences (positives) and dissimilar ones (negatives). Train it to push positives close together and pull negatives apart. This is the contrastive learning objective.
| Model | Dimensions | Strengths |
|---|---|---|
| OpenAI text-embedding-3-small | 1536 | General purpose, fast, cheap |
| OpenAI text-embedding-3-large | 3072 | Highest quality, supports truncation |
| Cohere embed-v3 | 1024 | Multilingual, task-specific types |
| BGE-large-en | 1024 | Open-source, strong benchmark scores |
| E5-mistral-7b | 4096 | LLM-based, best for complex queries |
Click any sentence pair to see how similar they are under mean pooling vs a trained bi-encoder.
You have two sentence vectors. How do you measure whether they're similar? The obvious answer is Euclidean distance — how far apart are the points? But for embeddings, a different measure works better: cosine similarity.
Cosine similarity measures the angle between two vectors, ignoring their lengths. Two vectors pointing in the same direction have cosine similarity 1.0. Perpendicular vectors: 0. Opposite directions: −1.0.
Why ignore length? Because the magnitude of an embedding often reflects text length, not semantic content. A long paragraph about cats and a short sentence about cats should be considered similar, even if their raw vectors have very different magnitudes.
Drag the sliders to rotate the vectors. Watch how cosine similarity (the angle) changes independently of vector length.
Intuitively, high-dimensional spaces seem harder. But for embeddings, more dimensions are better. The Johnson-Lindenstrauss lemma tells us something remarkable: N points in high-dimensional space can be projected down to roughly log(N) dimensions while preserving their pairwise distances. Real embedding models exploit this — 768 dimensions can faithfully encode tens of thousands of distinct concepts.
There's also a geometric benefit: in high dimensions, most vectors are nearly orthogonal by chance. This means the embedding space has lots of "room" to place distinct concepts far apart, leaving only truly similar concepts close together.
Once you have embeddings, you can use them for three key operations:
Knowing what embeddings are is one thing. Using them in production is another. Here's what actually matters when you reach for an embedding API.
python from openai import OpenAI client = OpenAI() # Embed a single sentence response = client.embeddings.create( model="text-embedding-3-small", input="The quick brown fox jumps over the lazy dog" ) vec = response.data[0].embedding # list of 1536 floats # Embed a batch of sentences (much more efficient) sentences = [ "Paris is the capital of France", "Berlin is Germany's largest city", "I enjoy eating pizza", ] response = client.embeddings.create( model="text-embedding-3-small", input=sentences ) vecs = [d.embedding for d in response.data] # list of 3 vectors # Cosine similarity between two embeddings import numpy as np def cosine_sim(a, b): a, b = np.array(a), np.array(b) return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)) sim = cosine_sim(vecs[0], vecs[1]) # ~0.92 (both about European capitals) sim = cosine_sim(vecs[0], vecs[2]) # ~0.31 (unrelated topics)
More dimensions = more expressive, but also more storage and slower nearest-neighbor search. A 1536-dim vector costs twice the storage and compute of a 768-dim one. For most applications, 768 or 1024 dimensions is plenty. Only go higher for nuanced retrieval tasks.
OpenAI's text-embedding-3 models support Matryoshka Representation Learning (MRL) — named after Russian nesting dolls. The model is trained so that the first 256 dimensions alone are a good embedding. The first 512 are better. All 1536 are best. You can truncate the vector at any point and still get a useful embedding.
This means you can trade quality for speed by shortening the vector at query time. Use 256 dimensions for a rough pre-filter, then re-rank with full 1536 dimensions only for the top candidates.
python # Using Matryoshka: truncate to 256 dims response = client.embeddings.create( model="text-embedding-3-small", input="my query", dimensions=256 # truncate and re-normalize internally ) # Same model, 6x smaller vectors, ~90% of the quality
| Model | Cost per million tokens | Typical doc tokens | Cost per 100k docs |
|---|---|---|---|
| text-embedding-3-small | $0.02 | ~200 | $0.40 |
| text-embedding-3-large | $0.13 | ~200 | $2.60 |
| BGE (self-hosted) | ~$0.001 | ~200 | $0.02 |
Documents change rarely; queries come constantly. The right strategy: embed all documents once and store them. At query time, embed only the query. If the same query appears multiple times, cache its embedding in Redis or memory.
So far, every example has been text. But what if you want to search for images using text? Or find products visually similar to a photo you took? This requires putting images and text into the same vector space.
In 2021, OpenAI released CLIP (Contrastive Language-Image Pre-training). The key idea: train two separate encoders — one for images, one for text — so that matching image-text pairs end up near each other in a shared embedding space.
Training data: 400 million (image, caption) pairs scraped from the internet. For each batch, CLIP sees N images and N captions. The correct image-caption pairs should have high cosine similarity. All N² wrong pairings should have low similarity. This is the contrastive objective.
Images and text captions land in the same space. Matching pairs cluster together. Click to highlight a pair.
Here's where it gets powerful. Suppose you want to classify an image as "cat" or "dog" but you never trained a classifier. With CLIP, you embed the image and embed the text labels ("a photo of a cat," "a photo of a dog"). Whichever text embedding is closest to the image embedding wins.
python import torch from transformers import CLIPProcessor, CLIPModel from PIL import Image model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32") processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32") image = Image.open("my_image.jpg") labels = ["a photo of a cat", "a photo of a dog", "a photo of a car"] inputs = processor(text=labels, images=image, return_tensors="pt", padding=True) outputs = model(**inputs) # Cosine similarity between image and each text label logits = outputs.logits_per_image # shape: [1, 3] probs = logits.softmax(dim=-1) # → [0.91, 0.07, 0.02] — it's a cat!
You never trained on these specific classes. The model generalizes because the text encodings capture semantics, and the image encodings capture visual features — and training aligned them.
CLIP proves the concept, but modern systems go further. ImageBind (Meta, 2023) embeds six modalities — images, text, audio, depth, thermal, and IMU data — into a single shared space. Train on image-audio pairs and image-text pairs, and you implicitly learn audio-text similarity even without direct audio-text training pairs. The shared space does the work.
Embeddings are powerful but not infallible. Every production embedding system has failure modes. Knowing them upfront saves you from debugging at 2 AM.
A general-purpose embedding model trained on web text has never seen your company's internal jargon. "JIRA-4521 blocks the Q3 OKR metric" is meaningless to it. The model will embed "JIRA-4521" as a random-ish vector because it never appeared in training data. Domain mismatch is the most common production failure.
Fix: fine-tune the embedding model on your domain's data using contrastive loss on pairs you know are similar/dissimilar. Or use a retrieval model that's been specialized for your domain.
This one is subtle and dangerous. The sentences "The product is good" and "The product is not good" have nearly identical embeddings. The word "not" is short and common — it doesn't change the semantic neighborhood much. Bag-of-words-style thinking bleeds into neural embeddings too.
Watch how cosine similarity changes (or doesn't) when we add negation.
Mean pooling averages word vectors. A 500-word article's embedding is an average of 500 vectors, each contributing 0.2%. Key concepts get diluted. Short texts (1-2 sentences) are much more focused — their embeddings are sharper and more discriminative. Long texts need chunking: split into 512-token windows, embed each, then store all chunks. At query time, retrieve chunks, not full documents.
You embedded your product catalog in January. In March, you added 10,000 new products. In June, product descriptions were rewritten. Now your embeddings are stale — they don't reflect the current state. Semantic search over stale embeddings will miss or misrank items.
Fix: treat embeddings like a cache. Invalidate and re-embed when source documents change. Keep a hash of the source text; if the hash changes, re-embed. For large corpora, re-embed on a schedule (nightly or weekly).
"Cloud" in 2005 meant weather. In 2025 it means AWS. An embedding model trained on 2005 text will embed "cloud computing" near "meteorology." Language meaning shifts, but model weights don't update automatically.
| Failure Mode | Symptom | Fix |
|---|---|---|
| Domain mismatch | Jargon embeddings are random-ish | Fine-tune on domain data |
| Negation blindness | "not good" ≈ "good" | Use a cross-encoder for re-ranking |
| Long text dilution | Key concepts lost in average | Chunk documents, store per-chunk |
| Stale embeddings | New/updated docs not found | Re-embed on change detection |
| Semantic drift | Old model, shifted language | Retrain or update model periodically |
Everything comes together here. Type two sentences, watch their simulated embedding vectors projected into 2D space, and see cosine similarity update live. Compare how one-hot encoding, bag-of-words, and dense embeddings represent the same sentences differently.
Try: "I love cats" vs "I hate cats" — watch how dense embeddings still place them near each other (negation problem). Try: "Paris is beautiful" vs "France is lovely" — dense embeddings catch the semantic link that one-hot misses entirely.
All three methods applied to the same sentence pair. See how the similarity scores differ.
Embeddings don't exist in isolation. They're the foundation beneath nearly every modern AI application. Here's how they connect to the larger landscape.
| Application | What Gets Embedded | What the Search Finds |
|---|---|---|
| RAG (ChatGPT Plugins) | Document chunks | Relevant context for the query |
| Semantic Search (Notion, Linear) | Pages, issues | Results matching query intent |
| Recommendation (Spotify, Netflix) | Songs, shows, user history | Items close to what you liked |
| Duplicate Detection | Bug reports, support tickets | Already-filed duplicates |
| Anomaly Detection | Transactions, log lines | Outliers far from the normal cluster |
| Cross-modal Search (Pinterest) | Images + text | Images matching a text query |
Exact nearest neighbor search over 100 million vectors is too slow. Real systems use Approximate Nearest Neighbor (ANN) algorithms that trade a small amount of accuracy for massive speed gains.
Two dominant approaches: HNSW (Hierarchical Navigable Small World) builds a graph where nearby vectors are connected. Searching means hopping through the graph — fast but memory-intensive. IVF (Inverted File Index) clusters vectors into groups; at query time, search only the closest groups, not all of them.
"The embedding is where meaning lives. Everything else is search."
— A useful way to think about modern AI systems