How discrete tokens become dense vectors — the lookup tables, tied weights, and patch projections at the foundation of every neural network.
The word "cat" enters a transformer as the integer 3782. The word "dog" enters as 5291. A neural network needs to do math with these — add them, multiply them, compute distances. But 3782 + 5291 = 9073, which means nothing. How do you turn arbitrary integers into vectors that capture meaning?
This isn't just a formatting problem. It's a representation problem. Integer IDs are categorical labels — they have no magnitude, no direction, no distance. Token 3782 isn't "closer" to token 3783 than to token 5291. The numbers were assigned arbitrarily when the tokenizer was built. They carry zero semantic information.
But neural networks live in continuous space. They multiply, add, and differentiate. They need inputs where "nearby" means "similar." We need a mapping from the discrete world of token IDs into a continuous vector space where similar meanings live at similar coordinates.
That mapping is called an embedding layer. And it's deceptively simple — it's just a table lookup.
Imagine a spreadsheet. Each row is a word in your vocabulary. Each row has d columns — say, 8 numbers. The word "cat" is row 3782. When you need a vector for "cat," you just go to row 3782 and read off the 8 numbers. No multiplication. No activation function. Just indexing.
Those 8 numbers are the embedding vector for "cat." They're learned parameters — they start random and get adjusted by gradient descent during training. After training, words with similar meanings end up with similar vectors. "cat" and "dog" are close. "cat" and "parliamentary" are far apart.
Let's see this in action.
Click a word to see its embedding vector (d=8 dimensions). Similar words have similar bar patterns. Below: a 2D projection showing how words cluster by meaning.
Click "cat" and then "dog." Notice how their bar charts look similar — both have high values in some of the same dimensions and low values in others. Now click "the." The pattern is completely different. That's because "cat" and "dog" are both animals, while "the" is a function word with entirely different grammatical and semantic properties.
The 2D projection at the bottom makes this even clearer. "cat," "dog," and "kitten" cluster together. "house" and "building" form another cluster. "the" sits alone. These clusters emerge naturally from training — the model learned that words used in similar contexts should have similar vectors.
Let's trace this by hand with a tiny vocabulary. We have 6 words, and each embedding has 3 dimensions (d=3).
Here's our embedding matrix (values learned from training):
| Token ID | Word | E[id][0] | E[id][1] | E[id][2] |
|---|---|---|---|---|
| 0 | the | -0.12 | 0.05 | 0.88 |
| 1 | cat | 0.72 | -0.41 | 0.15 |
| 2 | dog | 0.68 | -0.38 | 0.22 |
| 3 | sat | -0.55 | 0.62 | -0.03 |
| 4 | house | 0.31 | 0.15 | -0.72 |
| 5 | on | -0.08 | 0.11 | 0.79 |
Step 1: Token "cat" has ID 1. The embedding is row 1 of the matrix:
That's it. No multiplication. No bias. No activation function. Just: go to row 1, read off 3 numbers. This is the entire operation.
Step 2: Token "dog" has ID 2. Its embedding is row 2:
Step 3: How similar are "cat" and "dog"? Compute the Euclidean distance:
Now compare "cat" to "the":
"cat" is 14× closer to "dog" than to "the." The embedding space has organized itself so that semantic similarity maps to geometric proximity. This is the entire point of embeddings.
python # Embedding from scratch — it's literally a matrix with indexed rows import numpy as np class Embedding: def __init__(self, vocab_size, embed_dim): # Random initialization — training will adjust these self.weight = np.random.randn(vocab_size, embed_dim) * 0.02 def forward(self, token_ids): # That's it. Index into rows. No multiplication. return self.weight[token_ids] # Usage emb = Embedding(vocab_size=6, embed_dim=3) ids = np.array([1, 2, 0]) # "cat", "dog", "the" vectors = emb.forward(ids) # shape: (3, 3) # vectors[0] = emb.weight[1] → cat's embedding # vectors[1] = emb.weight[2] → dog's embedding # vectors[2] = emb.weight[0] → the's embedding
python # PyTorch equivalent — nn.Embedding wraps the same idea import torch import torch.nn as nn emb = nn.Embedding(num_embeddings=6, embedding_dim=3) print(emb.weight.shape) # torch.Size([6, 3]) ids = torch.tensor([1, 2, 0]) vectors = emb(ids) # shape: (3, 3) — same as weight[ids] # Proof: embedding IS just indexing assert torch.allclose(emb(ids), emb.weight[ids]) # True
nn.Embedding(V, d) stores a matrix of shape (V, d) and returns
matrix[token_id]. Mathematically, it's equivalent to multiplying
a one-hot vector by the weight matrix, but no implementation actually does
that — it would waste memory and compute on a vector that's all zeros except
for a single 1.
The embedding table is deceptively simple — it's just a big matrix. But it's also one of the largest parameter blocks in the entire model. For Llama 2 7B, the embedding matrix has 131 million parameters. That's more parameters than all of GPT-2 Small (124M).
Let's understand why it's so big, how it trains, and what can go wrong.
The token embedding matrix E has shape (V, d), where V is the vocabulary size (number of unique tokens) and d is the embedding dimension (length of each vector). Every entry is a learnable floating-point number.
Parameter count = V × d. Memory = V × d × bytes_per_param.
Let's compute this for real models:
| Model | V | d | Params | FP16 Memory |
|---|---|---|---|---|
| GPT-2 Small | 50,257 | 768 | 38.6M | 73 MB |
| BERT-Base | 30,522 | 768 | 23.4M | 45 MB |
| Llama 2 7B | 32,000 | 4,096 | 131.1M | 250 MB |
| Llama 3 8B | 128,256 | 4,096 | 525.4M | 1,001 MB |
| GPT-4 (est.) | 100,000 | 8,192 | 819.2M | 1,562 MB |
Look at Llama 3 vs Llama 2. Same embedding dimension, but Llama 3 quadrupled the vocabulary (from 32K to 128K tokens). That quadrupled the embedding table from 250 MB to 1 GB. Vocabulary size is the dominant cost driver.
Why would anyone want a bigger vocabulary? Because larger vocabularies encode text more efficiently — fewer tokens per sentence, shorter sequences, lower inference cost. But the embedding table gets proportionally larger. This is a fundamental tradeoff in model design.
Adjust vocabulary size and embedding dimension. The heatmap shows the embedding matrix (each cell is a parameter). Below: total parameter count and memory.
At initialization, every row of the embedding matrix is filled with small random numbers (typically drawn from a normal distribution with std = 0.02). At this point, "cat" and "dog" have random, unrelated vectors. The model can't tell them apart any more than it can tell them from "parliament."
During training, each forward pass selects a subset of rows — one for each token in the batch. The loss gradient flows backward through the model and eventually reaches the embedding layer. But here's the key: only the rows that were selected get a gradient update. If "cat" (ID 1) appeared in the batch, row 1 gets updated. Row 5291 ("dog") is untouched.
This is fundamentally different from a linear layer, where every weight participates in every forward pass. In an embedding layer, each row trains independently — only when its token appears in a batch.
This selective updating creates a problem. Tokens that appear frequently — "the," "is," "a" — get updated thousands of times per epoch. Their embeddings are well-trained, nuanced, and stable. They live in exactly the right part of the vector space.
But rare tokens — "pneumonoultramicroscopicsilicovolcanoconiosis," someone's unusual name, a niche technical term — might appear 5 times in the entire training set. Five gradient updates. Their embeddings stay close to their random initialization, carrying almost no learned information.
How does gradient descent work on a lookup table? Think of it this way: mathematically, looking up row i of matrix E is equivalent to computing the product eiT · E, where ei is a one-hot vector (all zeros except position i). No implementation actually materializes the one-hot vector — that would waste memory — but the gradient math works out the same way.
The gradient of the loss with respect to E[i] is simply the gradient that flowed back to this layer. For rows that weren't selected, the gradient is zero (they weren't used, so changing them can't change the loss). This is why optimizers like Adam keep per-parameter statistics — rare tokens get very different update patterns from frequent ones.
python # Demonstrating gradient flow through embedding lookup import torch import torch.nn as nn V, d = 6, 3 emb = nn.Embedding(V, d) # Forward: look up tokens 1 ("cat") and 2 ("dog") ids = torch.tensor([1, 2]) vecs = emb(ids) # shape: (2, 3) # Fake loss: sum all embedding values loss = vecs.sum() loss.backward() # Only rows 1 and 2 got gradients print(emb.weight.grad[0]) # tensor([0., 0., 0.]) ← "the" untouched print(emb.weight.grad[1]) # tensor([1., 1., 1.]) ← "cat" got gradient print(emb.weight.grad[2]) # tensor([1., 1., 1.]) ← "dog" got gradient print(emb.weight.grad[3]) # tensor([0., 0., 0.]) ← "sat" untouched
A single token needs more than just its word meaning. It also needs to know where it is in the sequence (position 0? position 47?) and sometimes which segment it belongs to (sentence A? sentence B?). A transformer doesn't process tokens one at a time like an RNN — it sees the whole sequence at once. Without explicit position information, it can't tell "dog bites man" from "man bites dog."
The solution: add multiple embedding vectors together. Each embedding encodes a different aspect of the input. They all live in the same d-dimensional space, and the model learns to disentangle them.
BERT computes the input to its transformer stack as:
Three separate lookup tables. Three separate row indices. Three vectors of the same dimension d. Added element-wise.
| Table | Shape | What it encodes | How many rows? |
|---|---|---|---|
| Etoken | (V, d) | Word identity ("cat", "sat", "on") | V = 30,522 (BERT) |
| Eposition | (max_len, d) | Position in sequence (0th, 1st, 47th) | max_len = 512 (BERT) |
| Esegment | (2, d) | Which sentence (A or B) | 2 (just two segments) |
Total embedding parameters for BERT-Base (d=768): 30,522 × 768 + 512 × 768 + 2 × 768 = 23,440,896 + 393,216 + 1,536 = 23,835,648. The token table dominates — it's 98.3% of the embedding parameters.
GPT-2 and GPT-3 use a simpler recipe — no segment embedding, because they don't do sentence-pair tasks:
Modern models like Llama and GPT-4 go even further. They use Rotary Position Embeddings (RoPE) instead of a learned position table. RoPE injects position information inside the attention mechanism, not at the input. This means the only embedding table is Etoken — the position information is handled differently (covered in the Positional Encoding lesson).
Let's trace BERT's recipe by hand with d=4.
Step 1: Look up each embedding.
Step 2: Add element-wise.
This single vector now encodes three things: the word is "cat," it's at position 3, and it belongs to sentence A. All compressed into 4 numbers. The transformer layers that follow will learn to disentangle these signals.
Three embedding vectors for a single token. Toggle each component on/off to see its contribution. Change the position slider to watch the position embedding shift while the token embedding stays fixed.
Drag the position slider and watch. The token embedding (orange bars) stays fixed — "cat" is "cat" regardless of where it appears. The position embedding (teal bars) changes — position 0 has a different pattern than position 7. The sum below shifts accordingly.
Toggle off the position component. Now the sum looks almost identical to just the token embedding. Toggle off the token component and leave only position. Now the sum looks nothing like "cat" — it's pure positional information. The model learns to use different dimensions for different types of information, making addition work despite all three signals occupying the same vector space.
The obvious alternative to addition is concatenation. Instead of adding three d-dimensional vectors, concatenate them into one 3d-dimensional vector. This keeps the information perfectly separated — no interference between token meaning and position.
So why doesn't anyone do this?
There's a deeper reason too. Addition is information-lossy in theory but sufficient in practice. With d=768 or d=4096, there are enough dimensions for the model to allocate different "channels" to different signal types. Research has shown that trained position embeddings and token embeddings are nearly orthogonal — they naturally learn to use different directions in the high-dimensional space, minimizing interference.
python import torch import torch.nn as nn class BERTEmbedding(nn.Module): def __init__(self, vocab_size, max_len, d_model, n_segments=2): super().__init__() self.token_emb = nn.Embedding(vocab_size, d_model) self.pos_emb = nn.Embedding(max_len, d_model) self.seg_emb = nn.Embedding(n_segments, d_model) self.norm = nn.LayerNorm(d_model) # BERT normalizes after adding self.drop = nn.Dropout(0.1) def forward(self, token_ids, segment_ids): # token_ids: (batch, seq_len) — integer token IDs # segment_ids: (batch, seq_len) — 0 or 1 seq_len = token_ids.size(1) positions = torch.arange(seq_len, device=token_ids.device) # Three lookups, one addition x = self.token_emb(token_ids) # (batch, seq, d) x = x + self.pos_emb(positions) # broadcast: (seq, d) → (batch, seq, d) x = x + self.seg_emb(segment_ids) return self.drop(self.norm(x)) # Usage emb = BERTEmbedding(vocab_size=30522, max_len=512, d_model=768) tokens = torch.randint(0, 30522, (2, 128)) # batch=2, seq=128 segs = torch.zeros(2, 128, dtype=torch.long) out = emb(tokens, segs) # (2, 128, 768)
python # HuggingFace: it's already inside the model from transformers import BertModel model = BertModel.from_pretrained("bert-base-uncased") # The three embedding tables live here: print(model.embeddings.word_embeddings.weight.shape) # (30522, 768) print(model.embeddings.position_embeddings.weight.shape) # (512, 768) print(model.embeddings.token_type_embeddings.weight.shape)# (2, 768)
Transformers process sequences. Text is a sequence of tokens — that's natural. But images aren't sequences. They're 2D grids of pixels. How do you feed an image into a transformer?
You could treat every pixel as a token. A 224×224 RGB image has 150,528 pixel values. But attention is O(n²) in sequence length — computing attention over 150K tokens is computationally impossible. Even with modern hardware, that's 22.6 billion attention scores per layer.
The Vision Transformer (ViT) solved this with a simple trick: chop the image into a grid of non-overlapping patches, treat each patch as a "word," and run the same transformer architecture. A 224×224 image with 16×16 patches gives 196 "tokens" — a manageable sequence that attention can process.
Here's exactly what happens, step by step:
Notice the coincidence: for ViT-Base with 16×16 patches, each flattened patch has 16 × 16 × 3 = 768 values — exactly the embedding dimension d=768. So the linear projection is a square matrix (768×768). This wasn't a design accident.
Let's work through a tiny example. Grayscale image (1 channel), 6×6 pixels, patch size 3×3.
Here's our 6×6 image (pixel values 0-9):
| 1 | 2 | 3 | 7 | 8 | 9 |
| 4 | 5 | 0 | 6 | 5 | 4 |
| 7 | 8 | 1 | 3 | 2 | 1 |
| 2 | 3 | 4 | 8 | 7 | 6 |
| 5 | 6 | 7 | 5 | 4 | 3 |
| 8 | 9 | 0 | 2 | 1 | 0 |
Patch 0 (top-left 3×3 block):
Patch 1 (top-right):
Patch 2 (bottom-left):
Patch 3 (bottom-right):
Linear projection: Weight matrix W is (9, 4). Let's project Patch 0:
For simplicity, suppose the projection of [1, 2, 3, 4, 5, 0, 7, 8, 1] through W gives us [0.42, -0.18, 0.73, 0.05]. That 4D vector is Patch 0's patch embedding. It's now a "token" that attention can process, just like the word "cat" in a language model.
After projecting all 4 patches, we have a sequence of length 4, each element a d=4 vector. The transformer processes this exactly like a 4-token text sequence.
The flatten-then-project operation can be implemented as a single convolution with kernel_size = patch_size and stride = patch_size. This is not just clever engineering — it's mathematically identical.
A convolution with kernel_size=16 and stride=16 slides a 16×16 window across the image, moving exactly 16 pixels each step (no overlap). At each position, it computes a weighted sum of the 768 pixel values (16×16×3) to produce one output value per filter. With dmodel filters, each position produces a d-dimensional vector. Each position corresponds to one patch.
python # Patch embedding from scratch: reshape + linear import torch import torch.nn as nn class PatchEmbedScratch(nn.Module): def __init__(self, img_size=224, patch_size=16, in_channels=3, d_model=768): super().__init__() self.patch_size = patch_size self.n_patches = (img_size // patch_size) ** 2 # 196 patch_dim = in_channels * patch_size * patch_size # 768 self.proj = nn.Linear(patch_dim, d_model) def forward(self, x): # x: (batch, channels, H, W) B, C, H, W = x.shape p = self.patch_size # Reshape: (B, C, H, W) → (B, n_patches, patch_dim) x = x.unfold(2, p, p).unfold(3, p, p) # (B, C, nH, nW, p, p) x = x.contiguous().view(B, C, -1, p * p) # (B, C, n_patches, p²) x = x.permute(0, 2, 1, 3).reshape(B, -1, C * p * p) # (B, n_patches, patch_dim) return self.proj(x) # (B, n_patches, d_model)
python # The Conv2d trick: mathematically identical, faster on GPU class PatchEmbedConv(nn.Module): def __init__(self, img_size=224, patch_size=16, in_channels=3, d_model=768): super().__init__() # One conv: kernel=16, stride=16 → no overlap, each position = one patch self.proj = nn.Conv2d(in_channels, d_model, kernel_size=patch_size, stride=patch_size) def forward(self, x): # x: (B, 3, 224, 224) x = self.proj(x) # (B, d_model, 14, 14) x = x.flatten(2) # (B, d_model, 196) x = x.transpose(1, 2) # (B, 196, d_model) — sequence of patch embeddings return x # Both produce identical shapes: img = torch.randn(1, 3, 224, 224) scratch = PatchEmbedScratch() conv = PatchEmbedConv() print(scratch(img).shape) # (1, 196, 768) print(conv(img).shape) # (1, 196, 768)
Patch size is the single most impactful hyperparameter in ViT. It controls both the resolution of the representation and the computational cost.
| Patch Size | Patches (224×224) | Attention Cost (n²) | Detail Level |
|---|---|---|---|
| 32 × 32 | 49 | 2,401 | Low — misses fine details |
| 16 × 16 | 196 | 38,416 | Standard — good balance |
| 8 × 8 | 784 | 614,656 | High — captures fine detail |
| 4 × 4 | 3,136 | 9,834,496 | Very high — costly |
Going from patch size 16 to 8 quadruples the number of patches, which increases the attention cost by 16× (since attention is quadratic). This is why ViT-B/16 (patch size 16) is far more common than ViT-B/8 in practice — the accuracy gain from smaller patches rarely justifies the 16× compute increase.
An image split into patches. Click any patch to see its flattened pixel vector and projected embedding. Adjust patch size to see the resolution-speed tradeoff.
Click different patches and watch the flattened vector change. Patches showing sky have low-contrast, similar pixel values — their vectors are "boring." Patches with edges or objects have high-variance pixels — their vectors are more "interesting." The linear projection learns to extract the patterns that matter for the downstream task.
ViT adds one extra token at the beginning of the sequence: the [CLS] token. This is a learnable embedding (not from any patch) that aggregates information from the entire image through attention. After the transformer, the [CLS] token's final hidden state is used for classification.
So the final input sequence is actually 197 tokens for a 224×224 image with 16×16 patches: 1 [CLS] + 196 patch embeddings. Each token also gets a learnable position embedding added (similar to BERT's recipe from Chapter 2), so the model knows which patch is top-left vs. bottom-right.
python # ViT's full embedding: patch projection + [CLS] + position class ViTEmbedding(nn.Module): def __init__(self, img_size=224, patch_size=16, d_model=768): super().__init__() n_patches = (img_size // patch_size) ** 2 # 196 # Patch embedding via Conv2d self.patch_emb = nn.Conv2d(3, d_model, kernel_size=patch_size, stride=patch_size) # Learnable [CLS] token — shape (1, 1, d) self.cls_token = nn.Parameter(torch.randn(1, 1, d_model) * 0.02) # Position embeddings for CLS + all patches self.pos_emb = nn.Parameter(torch.randn(1, n_patches + 1, d_model) * 0.02) def forward(self, x): B = x.shape[0] # Patch embeddings: (B, 3, 224, 224) → (B, 196, 768) patches = self.patch_emb(x).flatten(2).transpose(1, 2) # Prepend [CLS] token cls = self.cls_token.expand(B, -1, -1) # (B, 1, 768) x = torch.cat([cls, patches], dim=1) # (B, 197, 768) # Add position embeddings x = x + self.pos_emb # (B, 197, 768) return x
The input embedding maps token IDs to vectors. The output layer maps vectors back to token scores. Both are matrices of shape vocabulary × dimension. What if they were the same matrix? Using it forward to embed, using its transpose to predict?
That's weight tying, and it cuts your embedding parameters in half while often making the model better.
A language model's final job is to predict the next token. It has a hidden state h — a vector of dimension d that summarizes everything the model has read so far. It needs to turn h into a score for every token in the vocabulary. Token 0 gets a score, token 1 gets a score, all the way up to token V-1.
The standard approach: multiply h by a weight matrix Wout of shape (V, d). The result is a vector of V logits — one score per vocabulary token. The highest logit wins.
But wait. Wout has shape (V, d). The input embedding matrix E also has shape (V, d). Both map between a vocabulary-sized space and a d-dimensional space. They're doing mirror-image jobs.
Weight tying sets Wout = E. The same matrix serves double duty:
The dot product h · E[i] measures how similar the hidden state is to token i's embedding. High similarity = high score = model predicts that token. This creates a beautiful symmetry: tokens whose embeddings are close in the embedding space produce similar predictions at the output.
Let's trace through a concrete example. V = 4 tokens, d = 3 dimensions.
Embedding matrix E (4×3):
| Token | d0 | d1 | d2 |
|---|---|---|---|
| 0 ("the") | 0.5 | 0.3 | -0.1 |
| 1 ("cat") | 0.8 | -0.2 | 0.4 |
| 2 ("sat") | 0.1 | 0.9 | 0.3 |
| 3 ("on") | -0.3 | 0.5 | 0.6 |
The model has processed the sentence and produced a hidden state h = [0.6, 0.1, 0.3]. We now compute logits = h · ET, which means taking the dot product of h with each row of E:
Token 0 ("the"):
Token 1 ("cat"):
Token 2 ("sat"):
Token 3 ("on"):
Logits: [0.30, 0.58, 0.24, 0.05]. Token 1 ("cat") has the highest score. The hidden state h is most similar to the embedding for "cat," so the model predicts "cat" as the next token.
Let's count parameters for a real model. Llama 2 7B: V = 32,000 tokens, d = 4,096 dimensions.
| Without Tying | With Tying | |
|---|---|---|
| Input embedding E | 32,000 × 4,096 = 131M | 32,000 × 4,096 = 131M |
| Output projection W | 32,000 × 4,096 = 131M | Shared with E = 0 |
| Total embedding params | 262M | 131M |
| Memory (FP16) | ~500 MB | ~250 MB |
That's 131 million fewer parameters — about 250 MB of memory saved in FP16 precision. For larger vocabularies (Llama 3's 128K tokens), the savings are even more dramatic: 128,000 × 4,096 = 524M params saved.
But tying isn't only about saving memory. The shared representation acts as a constraint.
Weight tying is common but not universal:
| Model | Ties Embeddings? | Notes |
|---|---|---|
| GPT-2 | Yes | One of the earliest popular tied LMs |
| BERT | Yes | Ties input embeddings with MLM head |
| T5 | Yes | Enc/dec share the same embedding matrix |
| Llama 2 (7B) | No | Separate input and output matrices |
| Llama 3 (8B) | No | 128K vocab makes tying awkward with GQA |
| Gemma | Yes | Ties plus scales input embeddings |
| Mistral | No | Separate matrices |
The trend: smaller models benefit more from tying (the embedding parameters are a bigger fraction of total). Larger models can afford separate matrices and sometimes achieve marginally better performance without tying.
Below, we visualize the relationship between input embeddings and output projections. Toggle "tied" to see the matrices link together. The hidden state projects through the embedding matrix to produce scores — tokens closer to h in embedding space get higher logits.
Toggle tying on/off. Watch how the output logits change and the parameter count updates. Click a token to see its dot product with the hidden state.
python import torch import torch.nn as nn class TiedLM(nn.Module): def __init__(self, vocab_size, d_model): super().__init__() self.embed = nn.Embedding(vocab_size, d_model) # Output projection shares the embedding weight self.lm_head = nn.Linear(d_model, vocab_size, bias=False) self.lm_head.weight = self.embed.weight # THE TIE def forward(self, token_ids, hidden): # Input: embed tokens x = self.embed(token_ids) # (B, T) -> (B, T, d) # ... transformer layers process x into hidden ... # Output: logits = hidden @ E.T logits = self.lm_head(hidden) # (B, T, d) -> (B, T, V) return logits # Verify they share memory model = TiedLM(32000, 4096) print(model.embed.weight.data_ptr() == model.lm_head.weight.data_ptr()) # True — same tensor in memory
The critical line is self.lm_head.weight = self.embed.weight.
After this assignment, both layers reference the exact same tensor
in memory. Gradient updates from the output loss flow into the same
parameters that define the input embeddings. One gradient update improves
both embedding and prediction simultaneously.
You've combined token and position embeddings by addition. But what if one is much larger than the other? If position embeddings have values around 1.0 and token embeddings have values around 0.02, the position signal dominates. The model knows where every token is but barely knows what it is.
This is exactly what happens with the original Transformer's initialization. And the fix is a single multiplication.
When we initialize an embedding matrix, each entry is drawn from a distribution with standard deviation roughly 1/√d. For a model with d = 512, that means each element is about ±0.044.
How big is an entire embedding vector? Each of its d elements has variance 1/d, so the vector's squared magnitude (sum of squared elements) is approximately d × (1/d) = 1. The magnitude of a token embedding is roughly √1 = 1.0.
Now consider sinusoidal position embeddings. Each element is sin(…) or cos(…), so values range from -1 to +1. The squared magnitude is approximately d × 0.5 = d/2. The magnitude is roughly √(d/2).
For d = 512: position magnitude ≈ √256 = 16.0. Token magnitude ≈ 1.0. When we add them, position contributes 16× more than token identity. The model is almost entirely position, with a tiny whisper of "what token is this?"
After multiplying by √d, each element goes from std ≈ 1/√d to std ≈ 1. The vector's magnitude goes from ≈1 to ≈ √d. Now it matches the positional embedding magnitude.
Let d = 512. A token embedding vector e has d elements, each with std = 1/√512 ≈ 0.0442.
Without scaling:
Sinusoidal position embedding:
Ratio without scaling: position/token = 16.0/1.0 = 16:1. Position dominates.
With scaling (multiply e by √512 ≈ 22.6):
Now both signals contribute comparably. The model can distinguish both what a token is and where it is from the very first layer.
Let's use d = 4 for a tiny example we can trace completely.
Token embedding (initialized with std = 1/√4 = 0.5):
Magnitude: √(0.09 + 0.25 + 0.04 + 0.16) = √0.54 ≈ 0.735
Positional embedding (sinusoidal):
Magnitude: √(0.71 + 0.29 + 0.83 + 0.18) = √2.01 ≈ 1.418
Without scaling: combined = e + p
The result is dominated by the position values (0.84, 0.54, 0.91, -0.42). The token's contribution is barely visible — the 0.3 gets swamped by 0.84, the -0.5 nearly cancels with 0.54.
With scaling: escaled = e × √4 = e × 2
combined = escaled + p
Now both signals are clearly visible. The -1.0 from the token is preserved alongside the 0.54 from position, giving -0.46 instead of a washed-out 0.04.
| Architecture | Scales? | Position Type | Why |
|---|---|---|---|
| Original Transformer | Yes (√d) | Sinusoidal (fixed) | Fixed position values need token magnitudes to match |
| BERT | No | Learned | Learned positions adapt their scale during training |
| GPT-2 | No | Learned | Same reason as BERT |
| Llama | No | RoPE (rotation) | RoPE rotates Q/K, doesn't add to embedding |
| Gemma | Yes (√d) | RoPE | Google's choice for tied+scaled embeddings |
| T5 | No | Relative bias | Position is in attention bias, not added to embedding |
The pattern: scaling is mainly needed when fixed (sinusoidal) position embeddings are added to small-initialized token embeddings. Modern LLMs using RoPE don't add position to the embedding at all — they rotate the query and key vectors. No addition means no magnitude mismatch.
Adjust the scaling factor below to see how token and position signals combine. At low scale, position dominates (all tokens look similar in the combined representation). At √d, they balance. At very high scale, token dominates and position information is lost.
Drag the scale factor. Watch the token-to-position ratio and the combined vectors. The sweet spot is where both bars are roughly equal.
python import torch import torch.nn as nn import math class ScaledEmbedding(nn.Module): """Original Transformer embedding with sqrt(d) scaling.""" def __init__(self, vocab_size, d_model, max_len=5000): super().__init__() self.d_model = d_model self.tok_embed = nn.Embedding(vocab_size, d_model) self.pos_embed = self._sinusoidal(max_len, d_model) def forward(self, token_ids): seq_len = token_ids.size(1) tok = self.tok_embed(token_ids) # (B, T, d) tok = tok * math.sqrt(self.d_model) # SCALE! pos = self.pos_embed[:seq_len].unsqueeze(0) # (1, T, d) return tok + pos # balanced addition
The single line tok * math.sqrt(self.d_model) makes the
difference between a model that can barely distinguish tokens and one
where token identity and position are balanced from the start.
Everything we've learned — token IDs mapped to vectors, embeddings trained by gradient descent, parameter sharing, magnitude balancing — converges into one central question: what does the learned embedding space actually look like?
Random initialization scatters tokens across the space with no structure. Training sculpts that space: synonyms cluster together, antonyms push apart, and semantic relationships emerge as geometric patterns. The explorer below lets you watch this process unfold in real time.
The canvas shows a 2D projection of a simulated embedding space. Each dot is a token. At step 0 (random initialization), the dots are scattered with no pattern. As training progresses, semantic clusters emerge. Animals drift toward animals. Colors group with colors. The structure you see is the model discovering meaning.
Play with the controls:
Watch random vectors organize into semantic clusters as training progresses. Click a token to inspect its embedding.
In a real language model, embedding training happens implicitly through the language modeling objective: predict the next token. Tokens that appear in similar contexts develop similar embeddings because the gradients push them in similar directions.
Our simulation mimics this with a simplified attraction/repulsion model. Tokens in the same semantic category attract each other (their embeddings move closer). Tokens in different categories repel slightly. The result approximates what real language model training produces.
Three things to notice as training progresses:
The word "transformerization" might appear 3 times in your entire training set. Its embedding gets 3 gradient updates — barely better than random. But break it into subwords: "transform," "er," "ization." Each piece appears thousands of times. The composed representation from well-trained subwords is far better than a single undertrained whole-word embedding.
This is the core insight of subword tokenization, and it's why every modern language model uses it.
Suppose we give every word its own embedding. English has roughly 170,000 words in common use. Add technical terms, names, misspellings, and multilingual support, and you easily hit a million. That's a million rows in the embedding matrix, most of them barely trained.
Worse: a new word the model has never seen gets no embedding at all. The entire model breaks on a single unfamiliar word. This is the out-of-vocabulary (OOV) problem, and it plagued NLP for decades.
Byte Pair Encoding (Sennrich et al., 2016) solves both problems with a simple algorithm. Start with a character-level vocabulary (every letter is a token). Then repeatedly merge the most frequent adjacent pair into a new token.
Let's run BPE on a tiny corpus. Our training text:
text "the cat sat on the mat the cat"
Step 0: Character tokens
['t','h','e',' ','c','a','t',' ','s','a','t',' ','o','n',' ','t','h','e',' ','m','a','t',' ','t','h','e',' ','c','a','t']
Count all adjacent pairs:
| Pair | Count |
|---|---|
| t, h | 3 |
| h, e | 3 |
| a, t | 4 |
| c, a | 2 |
| e, ␣ | 3 |
| ␣, t | 2 |
| ␣, c | 2 |
| ␣, m | 1 |
| s, a | 1 |
| others | 1 each |
Merge 1: Most frequent pair is "a"+"t" (count 4). Create new token "at".
Result: ['t','h','e',' ','c','at',' ','s','at',' ','o','n',' ','t','h','e',' ','m','at',' ','t','h','e',' ','c','at']
Merge 2: Now "t"+"h" appears 3 times. Create "th".
Result: ['th','e',' ','c','at',' ','s','at',' ','o','n',' ','th','e',' ','m','at',' ','th','e',' ','c','at']
Merge 3: "th"+"e" appears 3 times. Create "the".
Result: ['the',' ','c','at',' ','s','at',' ','o','n',' ','the',' ','m','at',' ','the',' ','c','at']
After just 3 merges, "the" is a single token. High-frequency words collapse quickly. Rare words stay decomposed into common pieces.
Consider the word "unhappiness" (appears rarely) versus its subwords:
| Piece | Frequency | Gradient Updates | Embedding Quality |
|---|---|---|---|
| "unhappiness" (whole word) | ~50 | ~50 | Poor (barely trained) |
| "un" (prefix) | ~500,000 | ~500,000 | Excellent |
| "happi" (stem) | ~100,000 | ~100,000 | Excellent |
| "ness" (suffix) | ~300,000 | ~300,000 | Excellent |
The subword approach gives us three well-trained embeddings instead of one barely-trained one. The transformer's attention layers then compose these subword embeddings into a word-level representation. Attention learns that "un" means negation, "happi" carries the core meaning, and "ness" marks a noun. This compositional understanding transfers to every word with these subwords — "unhelpfulness," "happiness," "sadness" all benefit.
The number of BPE merges determines the vocabulary size. This is a critical hyperparameter with opposing forces:
| Small V (8K) | Medium V (32K) | Large V (100K) | |
|---|---|---|---|
| Tokens per word | 3-5 (many subwords) | 1-2 (fewer subwords) | 1 (often whole word) |
| Sequence length | Long (more tokens per sentence) | Moderate | Short (fewer tokens) |
| Subword frequency | Very high (well-trained) | High | Low for rare tokens |
| Embedding table size | 8K × d (small) | 32K × d (moderate) | 100K × d (large) |
| OOV handling | Never (can spell anything) | Never | Never (but rare tokens = poor) |
| Compute cost | High (attention is O(n²)) | Balanced | Lower per-token |
Most modern LLMs settle on V = 32K-128K as the sweet spot. Llama uses 32K. GPT-4 uses ~100K. Llama 3 moved to 128K to improve multilingual coverage (more languages means more unique subwords needed).
Type any word below to see how it gets tokenized under different vocabulary sizes. The simulation shows which subwords are well-trained (green, high frequency) versus barely trained (red, low frequency). Notice how smaller vocabularies split words into more pieces, but each piece is individually better trained.
Type a word and see how BPE splits it at different vocabulary sizes. Green subwords are well-trained; red ones are rare.
python # Simplified BPE from scratch def get_pairs(tokens): """Count all adjacent pairs.""" pairs = {} for i in range(len(tokens) - 1): pair = (tokens[i], tokens[i+1]) pairs[pair] = pairs.get(pair, 0) + 1 return pairs def merge(tokens, pair, new_token): """Replace every occurrence of pair with new_token.""" result = [] i = 0 while i < len(tokens): if i < len(tokens)-1 and tokens[i]==pair[0] and tokens[i+1]==pair[1]: result.append(new_token) i += 2 else: result.append(tokens[i]) i += 1 return result # Run BPE text = "the cat sat on the mat" tokens = list(text) # start with characters num_merges = 10 for _ in range(num_merges): pairs = get_pairs(tokens) if not pairs: break best = max(pairs, key=pairs.get) new_tok = best[0] + best[1] tokens = merge(tokens, best, new_tok) print(f"Merge: '{best[0]}' + '{best[1]}' -> '{new_tok}' Tokens: {tokens}")
python # Production tokenizers: tiktoken (GPT) and sentencepiece (Llama) import tiktoken enc = tiktoken.encoding_for_model("gpt-4") tokens = enc.encode("transformerization") print([enc.decode([t]) for t in tokens]) # ['transform', 'er', 'ization'] print(len(tokens)) # 3 subword tokens # Compare: a common word is a single token print(enc.encode("the")) # [1820] — one token print(enc.encode("hello")) # [15339] — one token
Not all embeddings are created equal. The number of gradient updates a token receives during training directly determines how well its embedding represents its meaning:
| Token Type | Training Frequency | Embedding Quality | Example |
|---|---|---|---|
| Stop words | Billions | Excellent | "the", "is", "and" |
| Common subwords | Millions | Very good | "ing", "tion", "pre" |
| Common words | Hundred thousands | Good | "transformer", "network" |
| Rare subwords | Thousands | Adequate | "zyg", "qph" |
| Single characters | Millions (as fallback) | Good (but carry less meaning) | "x", "z", "7" |
This is why the worst case for subword tokenization is still decent: even a word split into individual characters uses embeddings that have each seen millions of training examples. The model can still compose meaning from well-trained character embeddings — it just takes more layers of attention to do so.
You've learned six different ways to build embedding layers: small vocabularies with tied weights, large vocabularies with untied weights, subword BPE, character-level, and two patch sizes for vision. Each has tradeoffs in parameter count, embedding quality, sequence length, and compute cost. But which one wins on a real task?
That depends entirely on the task. The arena below lets you race all six strategies head-to-head across four different challenges. No single strategy dominates everywhere — that's the whole point of understanding the tradeoffs.
| Strategy | V | d | Params | Tokens/Input |
|---|---|---|---|---|
| Small V + Tied | 8,000 | 768 | 6.1M (shared) | ~40 (long sequences) |
| Large V + Untied | 128,000 | 4,096 | 1,049M (two matrices) | ~15 (short sequences) |
| Subword BPE (32K) | 32,000 | 768 | 24.6M | ~25 (balanced) |
| Character-Level | 256 | 768 | 0.2M | ~120 (very long) |
| Patch 16 (ViT) | N/A | 768 | 0.6M (projection) | 196 patches |
| Patch 32 (ViT) | N/A | 768 | 2.4M (projection) | 49 patches |
Each task favors different strengths. Watch which strategies rise and fall:
Select a task and watch the six strategies compete. Bars show simulated performance scores (higher = better). Click a bar for details on why each strategy scores the way it does.
No strategy wins every task. That's the deepest lesson here:
Everything from this lesson, compressed into a reference you can come back to.
| Component | What It Is | Shape | Key Fact |
|---|---|---|---|
| Token Embedding | Lookup table: integer → vector | (V, d) | No multiplication — just indexing |
| Position Embedding | Lookup table: position → vector | (max_len, d) | Added to token embedding (not concatenated) |
| Segment Embedding | Lookup table: segment ID → vector | (2, d) | BERT-specific; GPT and Llama don't use it |
| Patch Embedding | Conv2d or Linear projection | (patch_dim, d) | Equivalent to Conv2d(3, d, kernel=p, stride=p) |
| Weight Tying | Share E between input and output | Saves V×d params | lm_head.weight = embed.weight (same tensor) |
| Embedding Scaling | Multiply token emb by √d | Scalar multiply | Only needed with fixed sinusoidal positions |
| [CLS] Token | Learnable classification token | (1, d) | Prepended to patch sequence in ViT |
python import torch import torch.nn as nn import math # 1. Basic token embedding emb = nn.Embedding(num_embeddings=32000, embedding_dim=768) vectors = emb(torch.tensor([42, 1337])) # shape: (2, 768) # 2. Weight tying lm_head = nn.Linear(768, 32000, bias=False) lm_head.weight = emb.weight # same tensor in memory # 3. Embedding scaling (original Transformer) scaled = emb(ids) * math.sqrt(768) # match sinusoidal magnitude # 4. Patch embedding (ViT) patch_emb = nn.Conv2d(3, 768, kernel_size=16, stride=16) patches = patch_emb(img).flatten(2).transpose(1, 2) # (B, 196, 768) # 5. BERT-style combined embedding x = emb(token_ids) + pos_emb(positions) + seg_emb(segment_ids) # 6. Subword tokenization (production) import tiktoken enc = tiktoken.encoding_for_model("gpt-4") tokens = enc.encode("transformerization") # ['transform', 'er', 'ization']
| Model | V | d | Emb Params | Tied? | Position |
|---|---|---|---|---|---|
| GPT-2 Small | 50,257 | 768 | 38.6M | Yes | Learned |
| BERT-Base | 30,522 | 768 | 23.4M | Yes | Learned |
| Llama 2 7B | 32,000 | 4,096 | 131M | No | RoPE |
| Llama 3 8B | 128,256 | 4,096 | 525M | No | RoPE |
| ViT-B/16 | N/A | 768 | 0.6M | N/A | Learned |
Embedding layers are the first thing that happens in every neural network that processes discrete inputs. Understanding them unlocks everything downstream: