Embedding Layers — From Integers to Meaning

Chapter 0: From Integers to Meaning

The word "cat" enters a transformer as the integer 3782. The word "dog" enters as 5291. A neural network needs to do math with these — add them, multiply them, compute distances. But 3782 + 5291 = 9073, which means nothing. How do you turn arbitrary integers into vectors that capture meaning?

This isn't just a formatting problem. It's a representation problem. Integer IDs are categorical labels — they have no magnitude, no direction, no distance. Token 3782 isn't "closer" to token 3783 than to token 5291. The numbers were assigned arbitrarily when the tokenizer was built. They carry zero semantic information.

But neural networks live in continuous space. They multiply, add, and differentiate. They need inputs where "nearby" means "similar." We need a mapping from the discrete world of token IDs into a continuous vector space where similar meanings live at similar coordinates.

That mapping is called an embedding layer. And it's deceptively simple — it's just a table lookup.

The Lookup Table Idea

Imagine a spreadsheet. Each row is a word in your vocabulary. Each row has d columns — say, 8 numbers. The word "cat" is row 3782. When you need a vector for "cat," you just go to row 3782 and read off the 8 numbers. No multiplication. No activation function. Just indexing.

Those 8 numbers are the embedding vector for "cat." They're learned parameters — they start random and get adjusted by gradient descent during training. After training, words with similar meanings end up with similar vectors. "cat" and "dog" are close. "cat" and "parliamentary" are far apart.

Let's see this in action.

Embedding Lookup

Click a word to see its embedding vector (d=8 dimensions). Similar words have similar bar patterns. Below: a 2D projection showing how words cluster by meaning.

Click "cat" and then "dog." Notice how their bar charts look similar — both have high values in some of the same dimensions and low values in others. Now click "the." The pattern is completely different. That's because "cat" and "dog" are both animals, while "the" is a function word with entirely different grammatical and semantic properties.

The 2D projection at the bottom makes this even clearer. "cat," "dog," and "kitten" cluster together. "house" and "building" form another cluster. "the" sits alone. These clusters emerge naturally from training — the model learned that words used in similar contexts should have similar vectors.

Hand Calculation: Embedding Lookup

Let's trace this by hand with a tiny vocabulary. We have 6 words, and each embedding has 3 dimensions (d=3).

Setup. Vocabulary size V=6, embedding dimension d=3. The embedding matrix E has shape (6, 3) — one row per token, 3 numbers per row. That's V × d = 6 × 3 = 18 learnable parameters.

Here's our embedding matrix (values learned from training):

Token ID	Word	E[id][0]	E[id][1]	E[id][2]
0	the	-0.12	0.05	0.88
1	cat	0.72	-0.41	0.15
2	dog	0.68	-0.38	0.22
3	sat	-0.55	0.62	-0.03
4	house	0.31	0.15	-0.72
5	on	-0.08	0.11	0.79

Step 1: Token "cat" has ID 1. The embedding is row 1 of the matrix:

E[1] = [0.72, -0.41, 0.15]

That's it. No multiplication. No bias. No activation function. Just: go to row 1, read off 3 numbers. This is the entire operation.

Step 2: Token "dog" has ID 2. Its embedding is row 2:

E[2] = [0.68, -0.38, 0.22]

Step 3: How similar are "cat" and "dog"? Compute the Euclidean distance:

d₀ = (0.72 - 0.68)² = 0.04² = 0.0016
d₁ = (-0.41 - (-0.38))² = (-0.03)² = 0.0009
d₂ = (0.15 - 0.22)² = (-0.07)² = 0.0049
Distance = √(0.0016 + 0.0009 + 0.0049) = √(0.0074) = 0.086

Now compare "cat" to "the":

d₀ = (0.72 - (-0.12))² = 0.84² = 0.7056
d₁ = (-0.41 - 0.05)² = (-0.46)² = 0.2116
d₂ = (0.15 - 0.88)² = (-0.73)² = 0.5329
Distance = √(0.7056 + 0.2116 + 0.5329) = √(1.4501) = 1.204

"cat" is 14× closer to "dog" than to "the." The embedding space has organized itself so that semantic similarity maps to geometric proximity. This is the entire point of embeddings.

From Scratch, Then PyTorch

python
# Embedding from scratch — it's literally a matrix with indexed rows
import numpy as np

class Embedding:
    def __init__(self, vocab_size, embed_dim):
        # Random initialization — training will adjust these
        self.weight = np.random.randn(vocab_size, embed_dim) * 0.02

    def forward(self, token_ids):
        # That's it. Index into rows. No multiplication.
        return self.weight[token_ids]

# Usage
emb = Embedding(vocab_size=6, embed_dim=3)
ids = np.array([1, 2, 0])  # "cat", "dog", "the"
vectors = emb.forward(ids)     # shape: (3, 3)
# vectors[0] = emb.weight[1]  → cat's embedding
# vectors[1] = emb.weight[2]  → dog's embedding
# vectors[2] = emb.weight[0]  → the's embedding

python
# PyTorch equivalent — nn.Embedding wraps the same idea
import torch
import torch.nn as nn

emb = nn.Embedding(num_embeddings=6, embedding_dim=3)
print(emb.weight.shape)  # torch.Size([6, 3])

ids = torch.tensor([1, 2, 0])
vectors = emb(ids)  # shape: (3, 3) — same as weight[ids]

# Proof: embedding IS just indexing
assert torch.allclose(emb(ids), emb.weight[ids])  # True

An embedding layer is NOT a linear layer. A linear layer computes W × x + b for a continuous input x. An embedding layer takes an integer index and returns the corresponding row of a lookup table. No multiplication — just indexing. Under the hood, nn.Embedding(V, d) stores a matrix of shape (V, d) and returns matrix[token_id]. Mathematically, it's equivalent to multiplying a one-hot vector by the weight matrix, but no implementation actually does that — it would waste memory and compute on a vector that's all zeros except for a single 1.

What mathematical operation does an embedding layer perform?

A matrix multiplication: W × x + b A convolution over the input tokens A table lookup — it indexes into a matrix of learned parameters and returns the row for the token ID A softmax normalization of the token ID

Chapter 1: Token Embeddings in Detail

The embedding table is deceptively simple — it's just a big matrix. But it's also one of the largest parameter blocks in the entire model. For Llama 2 7B, the embedding matrix has 131 million parameters. That's more parameters than all of GPT-2 Small (124M).

Let's understand why it's so big, how it trains, and what can go wrong.

The Size of the Table

The token embedding matrix E has shape (V, d), where V is the vocabulary size (number of unique tokens) and d is the embedding dimension (length of each vector). Every entry is a learnable floating-point number.

Parameter count = V × d. Memory = V × d × bytes_per_param.

Memory arithmetic matters. In FP16 (2 bytes per number), each embedding vector of dimension d costs 2d bytes. Multiply by V tokens and you get the total embedding memory. This is a fixed cost — it doesn't depend on sequence length or batch size. It's just the weight of the table sitting in GPU memory.

Let's compute this for real models:

Model	V	d	Params	FP16 Memory
GPT-2 Small	50,257	768	38.6M	73 MB
BERT-Base	30,522	768	23.4M	45 MB
Llama 2 7B	32,000	4,096	131.1M	250 MB
Llama 3 8B	128,256	4,096	525.4M	1,001 MB
GPT-4 (est.)	100,000	8,192	819.2M	1,562 MB

Look at Llama 3 vs Llama 2. Same embedding dimension, but Llama 3 quadrupled the vocabulary (from 32K to 128K tokens). That quadrupled the embedding table from 250 MB to 1 GB. Vocabulary size is the dominant cost driver.

Why would anyone want a bigger vocabulary? Because larger vocabularies encode text more efficiently — fewer tokens per sentence, shorter sequences, lower inference cost. But the embedding table gets proportionally larger. This is a fundamental tradeoff in model design.

Embedding Table Size Explorer

Adjust vocabulary size and embedding dimension. The heatmap shows the embedding matrix (each cell is a parameter). Below: total parameter count and memory.

Vocab V 32,000

Dim d 4,096

How Training Works

At initialization, every row of the embedding matrix is filled with small random numbers (typically drawn from a normal distribution with std = 0.02). At this point, "cat" and "dog" have random, unrelated vectors. The model can't tell them apart any more than it can tell them from "parliament."

During training, each forward pass selects a subset of rows — one for each token in the batch. The loss gradient flows backward through the model and eventually reaches the embedding layer. But here's the key: only the rows that were selected get a gradient update. If "cat" (ID 1) appeared in the batch, row 1 gets updated. Row 5291 ("dog") is untouched.

This is fundamentally different from a linear layer, where every weight participates in every forward pass. In an embedding layer, each row trains independently — only when its token appears in a batch.

Forward Pass

Token IDs [4, 1, 0, 2] → index into E → vectors (4, d)

↓

Loss Computed

Cross-entropy or other loss → single scalar

↓

Backward Pass

Gradient ∂L/∂E[id] computed for ONLY the 4 selected rows

↓

Update

E[4] -= lr · grad[4], E[1] -= lr · grad[1], ... (only 4 rows touched)

The Rare Token Problem

This selective updating creates a problem. Tokens that appear frequently — "the," "is," "a" — get updated thousands of times per epoch. Their embeddings are well-trained, nuanced, and stable. They live in exactly the right part of the vector space.

But rare tokens — "pneumonoultramicroscopicsilicovolcanoconiosis," someone's unusual name, a niche technical term — might appear 5 times in the entire training set. Five gradient updates. Their embeddings stay close to their random initialization, carrying almost no learned information.

Rare tokens DON'T get good embeddings. If a token appears 5 times in the training set, its embedding row gets 5 gradient updates. A token appearing 5 million times gets 5 million updates. This is why subword tokenization (BPE, SentencePiece) matters — it ensures no token is too rare by breaking rare words into common subwords. The word "unhappiness" becomes ["un", "happiness"] — both subwords appear frequently enough to have good embeddings.

Gradient Flow Through a Lookup

How does gradient descent work on a lookup table? Think of it this way: mathematically, looking up row i of matrix E is equivalent to computing the product e_i^T · E, where e_i is a one-hot vector (all zeros except position i). No implementation actually materializes the one-hot vector — that would waste memory — but the gradient math works out the same way.

The gradient of the loss with respect to E[i] is simply the gradient that flowed back to this layer. For rows that weren't selected, the gradient is zero (they weren't used, so changing them can't change the loss). This is why optimizers like Adam keep per-parameter statistics — rare tokens get very different update patterns from frequent ones.

python
# Demonstrating gradient flow through embedding lookup
import torch
import torch.nn as nn

V, d = 6, 3
emb = nn.Embedding(V, d)

# Forward: look up tokens 1 ("cat") and 2 ("dog")
ids = torch.tensor([1, 2])
vecs = emb(ids)           # shape: (2, 3)

# Fake loss: sum all embedding values
loss = vecs.sum()
loss.backward()

# Only rows 1 and 2 got gradients
print(emb.weight.grad[0])  # tensor([0., 0., 0.])  ← "the" untouched
print(emb.weight.grad[1])  # tensor([1., 1., 1.])  ← "cat" got gradient
print(emb.weight.grad[2])  # tensor([1., 1., 1.])  ← "dog" got gradient
print(emb.weight.grad[3])  # tensor([0., 0., 0.])  ← "sat" untouched

Weight tying. Many models (GPT-2, T5, Llama) share the embedding matrix with the output projection layer. The same (V, d) matrix is used to convert token IDs to vectors at the input AND to convert hidden states back to vocabulary logits at the output. This halves the embedding memory and acts as a regularizer — the model must find embeddings that work well in both directions. This is called weight tying or shared embeddings.

For a model with vocabulary V=50,000 and embedding dimension d=2,048 stored in FP16, how much memory does the embedding table use?

~50 MB ~195 MB (50,000 × 2,048 × 2 bytes = 204,800,000 bytes) ~1.6 GB ~10 GB

Chapter 2: Combining Embeddings — The Input Recipe

A single token needs more than just its word meaning. It also needs to know where it is in the sequence (position 0? position 47?) and sometimes which segment it belongs to (sentence A? sentence B?). A transformer doesn't process tokens one at a time like an RNN — it sees the whole sequence at once. Without explicit position information, it can't tell "dog bites man" from "man bites dog."

The solution: add multiple embedding vectors together. Each embedding encodes a different aspect of the input. They all live in the same d-dimensional space, and the model learns to disentangle them.

BERT's Three-Part Recipe

BERT computes the input to its transformer stack as:

input = E_token[token_id] + E_position[pos] + E_segment[seg_id]

Three separate lookup tables. Three separate row indices. Three vectors of the same dimension d. Added element-wise.

Table	Shape	What it encodes	How many rows?
E_token	(V, d)	Word identity ("cat", "sat", "on")	V = 30,522 (BERT)
E_position	(max_len, d)	Position in sequence (0th, 1st, 47th)	max_len = 512 (BERT)
E_segment	(2, d)	Which sentence (A or B)	2 (just two segments)

Total embedding parameters for BERT-Base (d=768): 30,522 × 768 + 512 × 768 + 2 × 768 = 23,440,896 + 393,216 + 1,536 = 23,835,648. The token table dominates — it's 98.3% of the embedding parameters.

GPT's Two-Part Recipe

GPT-2 and GPT-3 use a simpler recipe — no segment embedding, because they don't do sentence-pair tasks:

input = E_token[token_id] + E_position[pos]

Modern models like Llama and GPT-4 go even further. They use Rotary Position Embeddings (RoPE) instead of a learned position table. RoPE injects position information inside the attention mechanism, not at the input. This means the only embedding table is E_token — the position information is handled differently (covered in the Positional Encoding lesson).

Hand Calculation: Adding Embeddings

Let's trace BERT's recipe by hand with d=4.

Setup. Token "cat" has ID 2. It sits at position 3 in the sequence. It belongs to segment 0 (sentence A). Each embedding table has d=4 columns.

Step 1: Look up each embedding.

E_token[2] = [0.31, -0.72, 0.15, 0.88]
E_position[3] = [0.05, 0.12, -0.33, 0.01]
E_segment[0] = [-0.02, 0.08, 0.11, -0.05]

Step 2: Add element-wise.

dim 0: 0.31 + 0.05 + (-0.02) = 0.34
dim 1: -0.72 + 0.12 + 0.08 = -0.52
dim 2: 0.15 + (-0.33) + 0.11 = -0.07
dim 3: 0.88 + 0.01 + (-0.05) = 0.84

input = [0.34, -0.52, -0.07, 0.84]

This single vector now encodes three things: the word is "cat," it's at position 3, and it belongs to sentence A. All compressed into 4 numbers. The transformer layers that follow will learn to disentangle these signals.

Embedding Addition

Three embedding vectors for a single token. Toggle each component on/off to see its contribution. Change the position slider to watch the position embedding shift while the token embedding stays fixed.

Position 3

Drag the position slider and watch. The token embedding (orange bars) stays fixed — "cat" is "cat" regardless of where it appears. The position embedding (teal bars) changes — position 0 has a different pattern than position 7. The sum below shifts accordingly.

Toggle off the position component. Now the sum looks almost identical to just the token embedding. Toggle off the token component and leave only position. Now the sum looks nothing like "cat" — it's pure positional information. The model learns to use different dimensions for different types of information, making addition work despite all three signals occupying the same vector space.

Why Add Instead of Concatenate?

The obvious alternative to addition is concatenation. Instead of adding three d-dimensional vectors, concatenate them into one 3d-dimensional vector. This keeps the information perfectly separated — no interference between token meaning and position.

So why doesn't anyone do this?

Why ADD instead of CONCATENATE? Concatenation would triple the dimension (d + d + d = 3d for BERT). Every subsequent layer — every attention head, every feed-forward network — would need to operate on 3d-dimensional inputs instead of d. This triples the compute and memory for the entire model, not just the embedding layer. Addition keeps the dimension at d, which means zero changes to the rest of the architecture. The model learns to use different subsets of dimensions for different information types — some dimensions primarily encode position, others primarily encode token identity — through training.

There's a deeper reason too. Addition is information-lossy in theory but sufficient in practice. With d=768 or d=4096, there are enough dimensions for the model to allocate different "channels" to different signal types. Research has shown that trained position embeddings and token embeddings are nearly orthogonal — they naturally learn to use different directions in the high-dimensional space, minimizing interference.

From Scratch: BERT's Embedding Layer

python
import torch
import torch.nn as nn

class BERTEmbedding(nn.Module):
    def __init__(self, vocab_size, max_len, d_model, n_segments=2):
        super().__init__()
        self.token_emb = nn.Embedding(vocab_size, d_model)
        self.pos_emb   = nn.Embedding(max_len, d_model)
        self.seg_emb   = nn.Embedding(n_segments, d_model)
        self.norm       = nn.LayerNorm(d_model)  # BERT normalizes after adding
        self.drop       = nn.Dropout(0.1)

    def forward(self, token_ids, segment_ids):
        # token_ids: (batch, seq_len)  — integer token IDs
        # segment_ids: (batch, seq_len) — 0 or 1
        seq_len = token_ids.size(1)
        positions = torch.arange(seq_len, device=token_ids.device)

        # Three lookups, one addition
        x = self.token_emb(token_ids)    # (batch, seq, d)
        x = x + self.pos_emb(positions)  # broadcast: (seq, d) → (batch, seq, d)
        x = x + self.seg_emb(segment_ids)
        return self.drop(self.norm(x))

# Usage
emb = BERTEmbedding(vocab_size=30522, max_len=512, d_model=768)
tokens = torch.randint(0, 30522, (2, 128))  # batch=2, seq=128
segs   = torch.zeros(2, 128, dtype=torch.long)
out    = emb(tokens, segs)  # (2, 128, 768)

python
# HuggingFace: it's already inside the model
from transformers import BertModel
model = BertModel.from_pretrained("bert-base-uncased")

# The three embedding tables live here:
print(model.embeddings.word_embeddings.weight.shape)      # (30522, 768)
print(model.embeddings.position_embeddings.weight.shape)  # (512, 768)
print(model.embeddings.token_type_embeddings.weight.shape)# (2, 768)

In BERT's input embedding, what three types of information are combined?

Query, Key, and Value vectors Token embedding (word identity), position embedding (where in the sequence), and segment embedding (sentence A or B) Input embedding, hidden state, and output logits Encoder embedding, decoder embedding, and cross-attention embedding

Chapter 3: Patch Embeddings — Images as Token Sequences

Transformers process sequences. Text is a sequence of tokens — that's natural. But images aren't sequences. They're 2D grids of pixels. How do you feed an image into a transformer?

You could treat every pixel as a token. A 224×224 RGB image has 150,528 pixel values. But attention is O(n²) in sequence length — computing attention over 150K tokens is computationally impossible. Even with modern hardware, that's 22.6 billion attention scores per layer.

The Vision Transformer (ViT) solved this with a simple trick: chop the image into a grid of non-overlapping patches, treat each patch as a "word," and run the same transformer architecture. A 224×224 image with 16×16 patches gives 196 "tokens" — a manageable sequence that attention can process.

The Mechanics

Here's exactly what happens, step by step:

Image

(3, 224, 224) — RGB, 224 pixels per side

↓

Split into Patches

224 ÷ 16 = 14 → 14 × 14 = 196 patches

↓

Flatten Each Patch

Each patch: 16 × 16 × 3 = 768 pixel values

↓

Linear Projection

768 → d_model via learned weight matrix

↓

Sequence of Embeddings

(196, d_model) — same format as text tokens

Notice the coincidence: for ViT-Base with 16×16 patches, each flattened patch has 16 × 16 × 3 = 768 values — exactly the embedding dimension d=768. So the linear projection is a square matrix (768×768). This wasn't a design accident.

Hand Calculation: A Tiny Image

Let's work through a tiny example. Grayscale image (1 channel), 6×6 pixels, patch size 3×3.

Setup. Image: 6×6, 1 channel. Patch size: 3×3. That gives 6/3 = 2 rows and 2 columns of patches = 4 patches total. Each patch has 3 × 3 × 1 = 9 pixel values. We'll project to d=4.

Here's our 6×6 image (pixel values 0-9):

1	2	3	7	8	9
4	5	0	6	5	4
7	8	1	3	2	1
2	3	4	8	7	6
5	6	7	5	4	3
8	9	0	2	1	0

Patch 0 (top-left 3×3 block):

Pixels: [[1, 2, 3], [4, 5, 0], [7, 8, 1]]
Flattened: [1, 2, 3, 4, 5, 0, 7, 8, 1] — a 9-dimensional vector

Patch 1 (top-right):

Pixels: [[7, 8, 9], [6, 5, 4], [3, 2, 1]]
Flattened: [7, 8, 9, 6, 5, 4, 3, 2, 1]

Patch 2 (bottom-left):

Flattened: [2, 3, 4, 5, 6, 7, 8, 9, 0]

Patch 3 (bottom-right):

Flattened: [8, 7, 6, 5, 4, 3, 2, 1, 0]

Linear projection: Weight matrix W is (9, 4). Let's project Patch 0:

For simplicity, suppose the projection of [1, 2, 3, 4, 5, 0, 7, 8, 1] through W gives us [0.42, -0.18, 0.73, 0.05]. That 4D vector is Patch 0's patch embedding. It's now a "token" that attention can process, just like the word "cat" in a language model.

After projecting all 4 patches, we have a sequence of length 4, each element a d=4 vector. The transformer processes this exactly like a 4-token text sequence.

The Conv2d Trick

The flatten-then-project operation can be implemented as a single convolution with kernel_size = patch_size and stride = patch_size. This is not just clever engineering — it's mathematically identical.

A convolution with kernel_size=16 and stride=16 slides a 16×16 window across the image, moving exactly 16 pixels each step (no overlap). At each position, it computes a weighted sum of the 768 pixel values (16×16×3) to produce one output value per filter. With d_model filters, each position produces a d-dimensional vector. Each position corresponds to one patch.

python
# Patch embedding from scratch: reshape + linear
import torch
import torch.nn as nn

class PatchEmbedScratch(nn.Module):
    def __init__(self, img_size=224, patch_size=16, in_channels=3, d_model=768):
        super().__init__()
        self.patch_size = patch_size
        self.n_patches = (img_size // patch_size) ** 2  # 196
        patch_dim = in_channels * patch_size * patch_size  # 768
        self.proj = nn.Linear(patch_dim, d_model)

    def forward(self, x):
        # x: (batch, channels, H, W)
        B, C, H, W = x.shape
        p = self.patch_size
        # Reshape: (B, C, H, W) → (B, n_patches, patch_dim)
        x = x.unfold(2, p, p).unfold(3, p, p)  # (B, C, nH, nW, p, p)
        x = x.contiguous().view(B, C, -1, p * p)  # (B, C, n_patches, p²)
        x = x.permute(0, 2, 1, 3).reshape(B, -1, C * p * p)  # (B, n_patches, patch_dim)
        return self.proj(x)  # (B, n_patches, d_model)

python
# The Conv2d trick: mathematically identical, faster on GPU
class PatchEmbedConv(nn.Module):
    def __init__(self, img_size=224, patch_size=16, in_channels=3, d_model=768):
        super().__init__()
        # One conv: kernel=16, stride=16 → no overlap, each position = one patch
        self.proj = nn.Conv2d(in_channels, d_model,
                              kernel_size=patch_size, stride=patch_size)

    def forward(self, x):
        # x: (B, 3, 224, 224)
        x = self.proj(x)  # (B, d_model, 14, 14)
        x = x.flatten(2)   # (B, d_model, 196)
        x = x.transpose(1, 2)  # (B, 196, d_model) — sequence of patch embeddings
        return x

# Both produce identical shapes:
img = torch.randn(1, 3, 224, 224)
scratch = PatchEmbedScratch()
conv = PatchEmbedConv()
print(scratch(img).shape)  # (1, 196, 768)
print(conv(img).shape)     # (1, 196, 768)

Patch Size: The Resolution-Speed Tradeoff

Patch size is the single most impactful hyperparameter in ViT. It controls both the resolution of the representation and the computational cost.

Patch Size	Patches (224×224)	Attention Cost (n²)	Detail Level
32 × 32	49	2,401	Low — misses fine details
16 × 16	196	38,416	Standard — good balance
8 × 8	784	614,656	High — captures fine detail
4 × 4	3,136	9,834,496	Very high — costly

Going from patch size 16 to 8 quadruples the number of patches, which increases the attention cost by 16× (since attention is quadratic). This is why ViT-B/16 (patch size 16) is far more common than ViT-B/8 in practice — the accuracy gain from smaller patches rarely justifies the 16× compute increase.

Patch Embedding Visualizer

An image split into patches. Click any patch to see its flattened pixel vector and projected embedding. Adjust patch size to see the resolution-speed tradeoff.

Patch Size 16×16

Click different patches and watch the flattened vector change. Patches showing sky have low-contrast, similar pixel values — their vectors are "boring." Patches with edges or objects have high-variance pixels — their vectors are more "interesting." The linear projection learns to extract the patterns that matter for the downstream task.

Patch embeddings are NOT the same as CNN features. A standard CNN builds features hierarchically over many layers — edges in layer 1, textures in layer 2, parts in layer 3, objects in layer 4. ViT's patch embedding is a single linear projection — no nonlinearity, no stacking, no receptive field growth. All the feature learning happens in the transformer layers afterward, not in the patch embedding. This is why ViT needs more data than CNNs — without the inductive biases of convolution (locality, translation equivariance), the transformer must learn spatial structure from scratch.

The [CLS] Token

ViT adds one extra token at the beginning of the sequence: the [CLS] token. This is a learnable embedding (not from any patch) that aggregates information from the entire image through attention. After the transformer, the [CLS] token's final hidden state is used for classification.

So the final input sequence is actually 197 tokens for a 224×224 image with 16×16 patches: 1 [CLS] + 196 patch embeddings. Each token also gets a learnable position embedding added (similar to BERT's recipe from Chapter 2), so the model knows which patch is top-left vs. bottom-right.

python
# ViT's full embedding: patch projection + [CLS] + position
class ViTEmbedding(nn.Module):
    def __init__(self, img_size=224, patch_size=16, d_model=768):
        super().__init__()
        n_patches = (img_size // patch_size) ** 2  # 196

        # Patch embedding via Conv2d
        self.patch_emb = nn.Conv2d(3, d_model,
                                   kernel_size=patch_size, stride=patch_size)

        # Learnable [CLS] token — shape (1, 1, d)
        self.cls_token = nn.Parameter(torch.randn(1, 1, d_model) * 0.02)

        # Position embeddings for CLS + all patches
        self.pos_emb = nn.Parameter(torch.randn(1, n_patches + 1, d_model) * 0.02)

    def forward(self, x):
        B = x.shape[0]
        # Patch embeddings: (B, 3, 224, 224) → (B, 196, 768)
        patches = self.patch_emb(x).flatten(2).transpose(1, 2)

        # Prepend [CLS] token
        cls = self.cls_token.expand(B, -1, -1)  # (B, 1, 768)
        x = torch.cat([cls, patches], dim=1)     # (B, 197, 768)

        # Add position embeddings
        x = x + self.pos_emb  # (B, 197, 768)
        return x

How does ViT convert a 224×224 image with 16×16 patches into a sequence for the transformer?

Resize the image to a 1D vector and feed it directly Run a CNN backbone first, then feed the final feature map Split into 14×14 = 196 non-overlapping patches, flatten each to 768 pixels, project to d_model — creating 196 token-like embeddings Treat each pixel as a separate token (150,528 tokens per image)

Chapter 4: Tied Embeddings

The input embedding maps token IDs to vectors. The output layer maps vectors back to token scores. Both are matrices of shape vocabulary × dimension. What if they were the same matrix? Using it forward to embed, using its transpose to predict?

That's weight tying, and it cuts your embedding parameters in half while often making the model better.

The Output Layer Problem

A language model's final job is to predict the next token. It has a hidden state h — a vector of dimension d that summarizes everything the model has read so far. It needs to turn h into a score for every token in the vocabulary. Token 0 gets a score, token 1 gets a score, all the way up to token V-1.

The standard approach: multiply h by a weight matrix W_out of shape (V, d). The result is a vector of V logits — one score per vocabulary token. The highest logit wins.

logits = h · W_out^T

But wait. W_out has shape (V, d). The input embedding matrix E also has shape (V, d). Both map between a vocabulary-sized space and a d-dimensional space. They're doing mirror-image jobs.

How Tying Works

Weight tying sets W_out = E. The same matrix serves double duty:

Forward pass (embedding): look up row i of E to get token i's embedding vector.
Output pass (prediction): compute logits = h · E^T. The dot product of h with each row of E gives the score for that token.

The dot product h · E[i] measures how similar the hidden state is to token i's embedding. High similarity = high score = model predicts that token. This creates a beautiful symmetry: tokens whose embeddings are close in the embedding space produce similar predictions at the output.

Why does this make sense? Think about what each matrix row means. Row i of the input embedding E is "what does token i look like as a vector?" Row i of the output W is "what should the hidden state look like when predicting token i?" Weight tying says these should be the same representation. If a token's embedding points in a certain direction, the hidden state should point in that same direction when the model wants to predict that token.

Hand Calculation: Tied Output Projection

Let's trace through a concrete example. V = 4 tokens, d = 3 dimensions.

Embedding matrix E (4×3):

Token	d₀	d₁	d₂
0 ("the")	0.5	0.3	-0.1
1 ("cat")	0.8	-0.2	0.4
2 ("sat")	0.1	0.9	0.3
3 ("on")	-0.3	0.5	0.6

The model has processed the sentence and produced a hidden state h = [0.6, 0.1, 0.3]. We now compute logits = h · E^T, which means taking the dot product of h with each row of E:

Token 0 ("the"):

0.6 × 0.5 + 0.1 × 0.3 + 0.3 × (-0.1)
= 0.30 + 0.03 - 0.03 = 0.30

Token 1 ("cat"):

0.6 × 0.8 + 0.1 × (-0.2) + 0.3 × 0.4
= 0.48 - 0.02 + 0.12 = 0.58

Token 2 ("sat"):

0.6 × 0.1 + 0.1 × 0.9 + 0.3 × 0.3
= 0.06 + 0.09 + 0.09 = 0.24

Token 3 ("on"):

0.6 × (-0.3) + 0.1 × 0.5 + 0.3 × 0.6
= -0.18 + 0.05 + 0.18 = 0.05

Logits: [0.30, 0.58, 0.24, 0.05]. Token 1 ("cat") has the highest score. The hidden state h is most similar to the embedding for "cat," so the model predicts "cat" as the next token.

The Parameter Savings

Let's count parameters for a real model. Llama 2 7B: V = 32,000 tokens, d = 4,096 dimensions.

	Without Tying	With Tying
Input embedding E	32,000 × 4,096 = 131M	32,000 × 4,096 = 131M
Output projection W	32,000 × 4,096 = 131M	Shared with E = 0
Total embedding params	262M	131M
Memory (FP16)	~500 MB	~250 MB

That's 131 million fewer parameters — about 250 MB of memory saved in FP16 precision. For larger vocabularies (Llama 3's 128K tokens), the savings are even more dramatic: 128,000 × 4,096 = 524M params saved.

But tying isn't only about saving memory. The shared representation acts as a constraint.

Tying is regularization, not just compression. Without tying, the input and output matrices can learn completely different representations. The input embedding might place "cat" and "kitten" close together, while the output matrix puts them far apart. With tying, the model is forced to use a single representation that works for both embedding tokens and predicting them. This constraint acts as an inductive bias and often improves generalization — the model can't "cheat" by learning separate, inconsistent spaces.

Who Ties and Who Doesn't

Weight tying is common but not universal:

Model	Ties Embeddings?	Notes
GPT-2	Yes	One of the earliest popular tied LMs
BERT	Yes	Ties input embeddings with MLM head
T5	Yes	Enc/dec share the same embedding matrix
Llama 2 (7B)	No	Separate input and output matrices
Llama 3 (8B)	No	128K vocab makes tying awkward with GQA
Gemma	Yes	Ties plus scales input embeddings
Mistral	No	Separate matrices

The trend: smaller models benefit more from tying (the embedding parameters are a bigger fraction of total). Larger models can afford separate matrices and sometimes achieve marginally better performance without tying.

The Simulation

Below, we visualize the relationship between input embeddings and output projections. Toggle "tied" to see the matrices link together. The hidden state projects through the embedding matrix to produce scores — tokens closer to h in embedding space get higher logits.

Tied vs Untied Embeddings

Toggle tying on/off. Watch how the output logits change and the parameter count updates. Click a token to see its dot product with the hidden state.

Untied: 262M params (2 matrices)

In Code

python
import torch
import torch.nn as nn

class TiedLM(nn.Module):
    def __init__(self, vocab_size, d_model):
        super().__init__()
        self.embed = nn.Embedding(vocab_size, d_model)
        # Output projection shares the embedding weight
        self.lm_head = nn.Linear(d_model, vocab_size, bias=False)
        self.lm_head.weight = self.embed.weight  # THE TIE

    def forward(self, token_ids, hidden):
        # Input: embed tokens
        x = self.embed(token_ids)       # (B, T) -> (B, T, d)
        # ... transformer layers process x into hidden ...
        # Output: logits = hidden @ E.T
        logits = self.lm_head(hidden)  # (B, T, d) -> (B, T, V)
        return logits

# Verify they share memory
model = TiedLM(32000, 4096)
print(model.embed.weight.data_ptr() == model.lm_head.weight.data_ptr())
# True — same tensor in memory

The critical line is self.lm_head.weight = self.embed.weight. After this assignment, both layers reference the exact same tensor in memory. Gradient updates from the output loss flow into the same parameters that define the input embeddings. One gradient update improves both embedding and prediction simultaneously.

What does weight tying between input and output embeddings achieve?

It doubles the embedding parameters to increase model capacity It forces the model to use separate representations for input and output It halves the embedding parameter count and provides an inductive bias that the same representation works for both embedding and prediction It only reduces memory usage without any effect on model quality

Chapter 5: Embedding Scaling

You've combined token and position embeddings by addition. But what if one is much larger than the other? If position embeddings have values around 1.0 and token embeddings have values around 0.02, the position signal dominates. The model knows where every token is but barely knows what it is.

This is exactly what happens with the original Transformer's initialization. And the fix is a single multiplication.

The Scale Mismatch

When we initialize an embedding matrix, each entry is drawn from a distribution with standard deviation roughly 1/√d. For a model with d = 512, that means each element is about ±0.044.

How big is an entire embedding vector? Each of its d elements has variance 1/d, so the vector's squared magnitude (sum of squared elements) is approximately d × (1/d) = 1. The magnitude of a token embedding is roughly √1 = 1.0.

Now consider sinusoidal position embeddings. Each element is sin(…) or cos(…), so values range from -1 to +1. The squared magnitude is approximately d × 0.5 = d/2. The magnitude is roughly √(d/2).

For d = 512: position magnitude ≈ √256 = 16.0. Token magnitude ≈ 1.0. When we add them, position contributes 16× more than token identity. The model is almost entirely position, with a tiny whisper of "what token is this?"

The analogy. Imagine mixing a whisper with a shout. The whisper carries critical information (the token's identity), and the shout carries less critical information (where in the sequence we are). Without amplifying the whisper, it gets drowned out. Scaling by √d is turning up the volume on the whisper until it matches the shout.

The Fix: Scale by √d

E_scaled = E[token_id] × √d_model

After multiplying by √d, each element goes from std ≈ 1/√d to std ≈ 1. The vector's magnitude goes from ≈1 to ≈ √d. Now it matches the positional embedding magnitude.

Hand Calculation: Magnitude Comparison

Let d = 512. A token embedding vector e has d elements, each with std = 1/√512 ≈ 0.0442.

Without scaling:

Expected element magnitude: 0.0442
Expected vector magnitude: √(d × (1/√d)²) = √(512 × 1/512) = √1 = 1.0

Sinusoidal position embedding:

Element values: sin/cos in [-1, +1], expected squared value ≈ 0.5
Expected vector magnitude: √(d × 0.5) = √256 = 16.0

Ratio without scaling: position/token = 16.0/1.0 = 16:1. Position dominates.

With scaling (multiply e by √512 ≈ 22.6):

Each element: 0.0442 × 22.6 ≈ 1.0
Vector magnitude: 1.0 × 22.6 = 22.6
Ratio: 16.0/22.6 ≈ 0.7:1. Balanced!

Now both signals contribute comparably. The model can distinguish both what a token is and where it is from the very first layer.

A Concrete Example

Let's use d = 4 for a tiny example we can trace completely.

Token embedding (initialized with std = 1/√4 = 0.5):

e = [0.3, -0.5, 0.2, 0.4]

Magnitude: √(0.09 + 0.25 + 0.04 + 0.16) = √0.54 ≈ 0.735

Positional embedding (sinusoidal):

p = [0.84, 0.54, 0.91, -0.42]

Magnitude: √(0.71 + 0.29 + 0.83 + 0.18) = √2.01 ≈ 1.418

Without scaling: combined = e + p

[0.3+0.84, -0.5+0.54, 0.2+0.91, 0.4-0.42] = [1.14, 0.04, 1.11, -0.02]

The result is dominated by the position values (0.84, 0.54, 0.91, -0.42). The token's contribution is barely visible — the 0.3 gets swamped by 0.84, the -0.5 nearly cancels with 0.54.

With scaling: e_scaled = e × √4 = e × 2

e_scaled = [0.6, -1.0, 0.4, 0.8]

combined = e_scaled + p

[0.6+0.84, -1.0+0.54, 0.4+0.91, 0.8-0.42] = [1.44, -0.46, 1.31, 0.38]

Now both signals are clearly visible. The -1.0 from the token is preserved alongside the 0.54 from position, giving -0.46 instead of a washed-out 0.04.

When Scaling Is and Isn't Used

Architecture	Scales?	Position Type	Why
Original Transformer	Yes (√d)	Sinusoidal (fixed)	Fixed position values need token magnitudes to match
BERT	No	Learned	Learned positions adapt their scale during training
GPT-2	No	Learned	Same reason as BERT
Llama	No	RoPE (rotation)	RoPE rotates Q/K, doesn't add to embedding
Gemma	Yes (√d)	RoPE	Google's choice for tied+scaled embeddings
T5	No	Relative bias	Position is in attention bias, not added to embedding

The pattern: scaling is mainly needed when fixed (sinusoidal) position embeddings are added to small-initialized token embeddings. Modern LLMs using RoPE don't add position to the embedding at all — they rotate the query and key vectors. No addition means no magnitude mismatch.

Not all transformers need embedding scaling. It was in the original "Attention Is All You Need" paper because sinusoidal position embeddings have fixed magnitude ≈√(d/2), while token embeddings initialize small. Modern LLMs using RoPE don't need it — RoPE rotates Q and K instead of adding to the embedding. If you see √d scaling in a codebase, check whether the architecture actually requires it or if it's cargo-culted from the original paper.

The Simulation

Adjust the scaling factor below to see how token and position signals combine. At low scale, position dominates (all tokens look similar in the combined representation). At √d, they balance. At very high scale, token dominates and position information is lost.

Embedding Scale Balance

Drag the scale factor. Watch the token-to-position ratio and the combined vectors. The sweet spot is where both bars are roughly equal.

Scale factor 1.0

d_model 512

Token:Position ratio = 1.0 : 16.0

In Code

python
import torch
import torch.nn as nn
import math

class ScaledEmbedding(nn.Module):
    """Original Transformer embedding with sqrt(d) scaling."""
    def __init__(self, vocab_size, d_model, max_len=5000):
        super().__init__()
        self.d_model = d_model
        self.tok_embed = nn.Embedding(vocab_size, d_model)
        self.pos_embed = self._sinusoidal(max_len, d_model)

    def forward(self, token_ids):
        seq_len = token_ids.size(1)
        tok = self.tok_embed(token_ids)              # (B, T, d)
        tok = tok * math.sqrt(self.d_model)          # SCALE!
        pos = self.pos_embed[:seq_len].unsqueeze(0)  # (1, T, d)
        return tok + pos                              # balanced addition

The single line tok * math.sqrt(self.d_model) makes the difference between a model that can barely distinguish tokens and one where token identity and position are balanced from the start.

Why does the original Transformer multiply token embeddings by √d_model?

To make the embedding matrix invertible for weight tying To match the magnitude of token embeddings (initialized small) with sinusoidal position embeddings (magnitude ~1 per element), so neither signal dominates when added To normalize embeddings to unit length for cosine similarity To prevent gradient explosion in the first layer

Chapter 6: The Embedding Space Explorer

Everything we've learned — token IDs mapped to vectors, embeddings trained by gradient descent, parameter sharing, magnitude balancing — converges into one central question: what does the learned embedding space actually look like?

Random initialization scatters tokens across the space with no structure. Training sculpts that space: synonyms cluster together, antonyms push apart, and semantic relationships emerge as geometric patterns. The explorer below lets you watch this process unfold in real time.

What You'll See

The canvas shows a 2D projection of a simulated embedding space. Each dot is a token. At step 0 (random initialization), the dots are scattered with no pattern. As training progresses, semantic clusters emerge. Animals drift toward animals. Colors group with colors. The structure you see is the model discovering meaning.

Play with the controls:

Vocabulary theme: Choose which tokens populate the space. Each theme has natural clusters the training will discover.
Embedding dimension: Higher dimensions give the space more room. Clusters tighten because there are more directions to separate them. Lower dimensions force compromises — clusters overlap.
Training steps: Slide or animate to watch the embeddings evolve. The first few steps show dramatic reorganization. Later steps refine the structure.
Show similarity lines: Toggle lines between tokens with high cosine similarity. After training, these connect semantically related tokens.
Show clusters: Draw convex hulls around discovered clusters.

What to try. Start with "Animals & Colors" theme, dimension 8, and press Play. Watch the clusters separate. Then switch to dimension 2 and replay — see how the low-dimensional space struggles to separate them cleanly. Switch to "Mixed" theme for a harder clustering problem. Click any token to see its embedding vector as a bar chart on the right.

Embedding Space Explorer

Watch random vectors organize into semantic clusters as training progresses. Click a token to inspect its embedding.

Training step 0 / 500

Labels Similarity lines Clusters

Click a token to inspect its embedding vector.

What the Training Is Doing

In a real language model, embedding training happens implicitly through the language modeling objective: predict the next token. Tokens that appear in similar contexts develop similar embeddings because the gradients push them in similar directions.

Our simulation mimics this with a simplified attraction/repulsion model. Tokens in the same semantic category attract each other (their embeddings move closer). Tokens in different categories repel slightly. The result approximates what real language model training produces.

Three things to notice as training progresses:

Cluster formation (steps 0-100): Initially random tokens rapidly separate into groups. This is the "big picture" phase where the model learns major categories.
Cluster tightening (steps 100-300): Within-cluster tokens move closer together. The model refines distinctions: "cat" and "dog" aren't just "animals" — they're specific animals.
Fine structure (steps 300-500): Subtle sub-clusters appear. Among animals, pets cluster separately from wildlife. Among colors, warm colors separate from cool colors.

The dimension slider reveals the information bottleneck. At dim=2, the space literally doesn't have enough directions to separate all clusters cleanly. Some groups must overlap. At dim=32, there's room for nuance — every token gets its own corner. This is why real models use d=768 or d=4096: the embedding space needs to represent the entire vocabulary with all its semantic relationships, and that requires a high-dimensional canvas.

Chapter 7: Subword Embeddings

The word "transformerization" might appear 3 times in your entire training set. Its embedding gets 3 gradient updates — barely better than random. But break it into subwords: "transform," "er," "ization." Each piece appears thousands of times. The composed representation from well-trained subwords is far better than a single undertrained whole-word embedding.

This is the core insight of subword tokenization, and it's why every modern language model uses it.

The Whole-Word Problem

Suppose we give every word its own embedding. English has roughly 170,000 words in common use. Add technical terms, names, misspellings, and multilingual support, and you easily hit a million. That's a million rows in the embedding matrix, most of them barely trained.

Worse: a new word the model has never seen gets no embedding at all. The entire model breaks on a single unfamiliar word. This is the out-of-vocabulary (OOV) problem, and it plagued NLP for decades.

BPE: Byte Pair Encoding

Byte Pair Encoding (Sennrich et al., 2016) solves both problems with a simple algorithm. Start with a character-level vocabulary (every letter is a token). Then repeatedly merge the most frequent adjacent pair into a new token.

Start

Vocabulary = all characters: a, b, c, ..., z, space, etc.

↓

Scan

Find the most frequent adjacent pair in the training text.

↓

Merge

Replace every occurrence of that pair with a new token.

↓

Repeat

Continue until vocabulary reaches target size (e.g., 32,000).

↻ iterate

Hand Calculation: BPE Merges

Let's run BPE on a tiny corpus. Our training text:

text
"the cat sat on the mat the cat"

Step 0: Character tokens

['t','h','e',' ','c','a','t',' ','s','a','t',' ','o','n',' ','t','h','e',' ','m','a','t',' ','t','h','e',' ','c','a','t']

Count all adjacent pairs:

Pair	Count
t, h	3
h, e	3
a, t	4
c, a	2
e, ␣	3
␣, t	2
␣, c	2
␣, m	1
s, a	1
others	1 each

Merge 1: Most frequent pair is "a"+"t" (count 4). Create new token "at".

Result: ['t','h','e',' ','c','at',' ','s','at',' ','o','n',' ','t','h','e',' ','m','at',' ','t','h','e',' ','c','at']

Merge 2: Now "t"+"h" appears 3 times. Create "th".

Result: ['th','e',' ','c','at',' ','s','at',' ','o','n',' ','th','e',' ','m','at',' ','th','e',' ','c','at']

Merge 3: "th"+"e" appears 3 times. Create "the".

Result: ['the',' ','c','at',' ','s','at',' ','o','n',' ','the',' ','m','at',' ','the',' ','c','at']

After just 3 merges, "the" is a single token. High-frequency words collapse quickly. Rare words stay decomposed into common pieces.

Why Subword Embeddings Are Better

Consider the word "unhappiness" (appears rarely) versus its subwords:

Piece	Frequency	Gradient Updates	Embedding Quality
"unhappiness" (whole word)	~50	~50	Poor (barely trained)
"un" (prefix)	~500,000	~500,000	Excellent
"happi" (stem)	~100,000	~100,000	Excellent
"ness" (suffix)	~300,000	~300,000	Excellent

The subword approach gives us three well-trained embeddings instead of one barely-trained one. The transformer's attention layers then compose these subword embeddings into a word-level representation. Attention learns that "un" means negation, "happi" carries the core meaning, and "ness" marks a noun. This compositional understanding transfers to every word with these subwords — "unhelpfulness," "happiness," "sadness" all benefit.

Subword embeddings are compositional, not concatenated. Each subword gets its own embedding vector (looked up from the embedding matrix just like any token). These vectors are then processed by the transformer's attention layers, which learn to compose subword meanings into word meanings. The embedding layer provides the raw ingredients; attention does the cooking.

The Vocabulary Size Tradeoff

The number of BPE merges determines the vocabulary size. This is a critical hyperparameter with opposing forces:

	Small V (8K)	Medium V (32K)	Large V (100K)
Tokens per word	3-5 (many subwords)	1-2 (fewer subwords)	1 (often whole word)
Sequence length	Long (more tokens per sentence)	Moderate	Short (fewer tokens)
Subword frequency	Very high (well-trained)	High	Low for rare tokens
Embedding table size	8K × d (small)	32K × d (moderate)	100K × d (large)
OOV handling	Never (can spell anything)	Never	Never (but rare tokens = poor)
Compute cost	High (attention is O(n²))	Balanced	Lower per-token

Most modern LLMs settle on V = 32K-128K as the sweet spot. Llama uses 32K. GPT-4 uses ~100K. Llama 3 moved to 128K to improve multilingual coverage (more languages means more unique subwords needed).

The Simulation

Type any word below to see how it gets tokenized under different vocabulary sizes. The simulation shows which subwords are well-trained (green, high frequency) versus barely trained (red, low frequency). Notice how smaller vocabularies split words into more pieces, but each piece is individually better trained.

Subword Tokenizer Explorer

Type a word and see how BPE splits it at different vocabulary sizes. Green subwords are well-trained; red ones are rare.

Vocab size 32K

Tokens: 3 | Total subword frequency: high

In Code

python
# Simplified BPE from scratch
def get_pairs(tokens):
    """Count all adjacent pairs."""
    pairs = {}
    for i in range(len(tokens) - 1):
        pair = (tokens[i], tokens[i+1])
        pairs[pair] = pairs.get(pair, 0) + 1
    return pairs

def merge(tokens, pair, new_token):
    """Replace every occurrence of pair with new_token."""
    result = []
    i = 0
    while i < len(tokens):
        if i < len(tokens)-1 and tokens[i]==pair[0] and tokens[i+1]==pair[1]:
            result.append(new_token)
            i += 2
        else:
            result.append(tokens[i])
            i += 1
    return result

# Run BPE
text = "the cat sat on the mat"
tokens = list(text)            # start with characters
num_merges = 10

for _ in range(num_merges):
    pairs = get_pairs(tokens)
    if not pairs: break
    best = max(pairs, key=pairs.get)
    new_tok = best[0] + best[1]
    tokens = merge(tokens, best, new_tok)
    print(f"Merge: '{best[0]}' + '{best[1]}' -> '{new_tok}'  Tokens: {tokens}")

python
# Production tokenizers: tiktoken (GPT) and sentencepiece (Llama)
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4")
tokens = enc.encode("transformerization")
print([enc.decode([t]) for t in tokens])
# ['transform', 'er', 'ization']

print(len(tokens))  # 3 subword tokens

# Compare: a common word is a single token
print(enc.encode("the"))   # [1820] — one token
print(enc.encode("hello")) # [15339] — one token

Subword embeddings are NOT just concatenated. Each subword gets its own embedding vector, looked up independently from the embedding table. These separate vectors enter the transformer as separate tokens in the sequence. The attention mechanism then learns to compose subword meanings into word-level representations across layers. This is why "un" + "happi" + "ness" doesn't just mean their embeddings glued together — it means the transformer has multiple layers to recognize the negation prefix, the emotional stem, and the noun suffix, and to build a coherent whole-word representation from these pieces.

The Embedding Quality Hierarchy

Not all embeddings are created equal. The number of gradient updates a token receives during training directly determines how well its embedding represents its meaning:

Token Type	Training Frequency	Embedding Quality	Example
Stop words	Billions	Excellent	"the", "is", "and"
Common subwords	Millions	Very good	"ing", "tion", "pre"
Common words	Hundred thousands	Good	"transformer", "network"
Rare subwords	Thousands	Adequate	"zyg", "qph"
Single characters	Millions (as fallback)	Good (but carry less meaning)	"x", "z", "7"

This is why the worst case for subword tokenization is still decent: even a word split into individual characters uses embeddings that have each seen millions of training examples. The model can still compose meaning from well-trained character embeddings — it just takes more layers of attention to do so.

Why does subword tokenization produce better embeddings for rare words than whole-word tokenization?

Subword tokens are always shorter, so they process faster Rare words are split into common subwords that each have well-trained embeddings from frequent exposure, so the composed representation is better than a single undertrained whole-word embedding Subword tokenization uses a larger embedding dimension per token Whole-word tokenization doesn't support gradient descent

Chapter 8: Embedding Strategy Arena

You've learned six different ways to build embedding layers: small vocabularies with tied weights, large vocabularies with untied weights, subword BPE, character-level, and two patch sizes for vision. Each has tradeoffs in parameter count, embedding quality, sequence length, and compute cost. But which one wins on a real task?

That depends entirely on the task. The arena below lets you race all six strategies head-to-head across four different challenges. No single strategy dominates everywhere — that's the whole point of understanding the tradeoffs.

The Six Strategies

Strategy	V	d	Params	Tokens/Input
Small V + Tied	8,000	768	6.1M (shared)	~40 (long sequences)
Large V + Untied	128,000	4,096	1,049M (two matrices)	~15 (short sequences)
Subword BPE (32K)	32,000	768	24.6M	~25 (balanced)
Character-Level	256	768	0.2M	~120 (very long)
Patch 16 (ViT)	N/A	768	0.6M (projection)	196 patches
Patch 32 (ViT)	N/A	768	2.4M (projection)	49 patches

The Four Tasks

Each task favors different strengths. Watch which strategies rise and fall:

Language Modeling: Predict the next token. Large vocabularies and subword BPE shine — fewer tokens per sentence means less attention cost and better long-range dependencies. Character-level struggles with sequence length.
Sentence Embedding: Produce a single vector summarizing a sentence. Tied weights help because the shared representation creates consistent semantics. Character-level pays a heavy attention cost for long sequences.
Image Classification: Classify a 224×224 image. Only patch strategies apply here. Patch 16 gives more detail (196 tokens) but costs more. Patch 32 is faster (49 tokens) but misses fine features.
Multilingual: Handle text in 100+ languages. Large vocabularies cover more scripts. Character-level handles any script but makes very long sequences. Small vocabularies fail on non-Latin scripts.

Embedding Strategy Arena

Select a task and watch the six strategies compete. Bars show simulated performance scores (higher = better). Click a bar for details on why each strategy scores the way it does.

Task: Language Modeling — click a bar for details.

Reading the Results

No strategy wins every task. That's the deepest lesson here:

There is no universal embedding strategy. The right choice depends on your task, your compute budget, your language coverage, and your sequence length constraints. Subword BPE at V=32K is the most common default because it balances all factors reasonably. But if you need multilingual coverage, you push V higher. If you need vision, you use patches. If you need extreme efficiency, you consider character-level with an efficient attention mechanism. Understanding the tradeoffs — not memorizing a recipe — is what matters.

Chapter 9: Cheat Sheet

Everything from this lesson, compressed into a reference you can come back to.

Decision Flowchart

What is your input?

Text → go to token embeddings. Images → go to patch embeddings.

↓

Text: How many languages?

One language → V=32K is fine. Many languages → V=100K+ for script coverage.

↓

Text: Model size?

Small (<1B params) → tie weights, it's free regularization. Large (>7B) → untying often wins.

↓

Text: Position encoding?

Sinusoidal (fixed) → scale embeddings by √d. Learned or RoPE → no scaling needed.

↓

Images: Resolution needed?

Standard → patch 16 (196 tokens). Speed-critical → patch 32 (49 tokens). Fine detail → patch 8 (784 tokens).

Component Catalog

Component	What It Is	Shape	Key Fact
Token Embedding	Lookup table: integer → vector	(V, d)	No multiplication — just indexing
Position Embedding	Lookup table: position → vector	(max_len, d)	Added to token embedding (not concatenated)
Segment Embedding	Lookup table: segment ID → vector	(2, d)	BERT-specific; GPT and Llama don't use it
Patch Embedding	Conv2d or Linear projection	(patch_dim, d)	Equivalent to Conv2d(3, d, kernel=p, stride=p)
Weight Tying	Share E between input and output	Saves V×d params	lm_head.weight = embed.weight (same tensor)
Embedding Scaling	Multiply token emb by √d	Scalar multiply	Only needed with fixed sinusoidal positions
[CLS] Token	Learnable classification token	(1, d)	Prepended to patch sequence in ViT

PyTorch Quick Reference

python
import torch
import torch.nn as nn
import math

# 1. Basic token embedding
emb = nn.Embedding(num_embeddings=32000, embedding_dim=768)
vectors = emb(torch.tensor([42, 1337]))  # shape: (2, 768)

# 2. Weight tying
lm_head = nn.Linear(768, 32000, bias=False)
lm_head.weight = emb.weight  # same tensor in memory

# 3. Embedding scaling (original Transformer)
scaled = emb(ids) * math.sqrt(768)  # match sinusoidal magnitude

# 4. Patch embedding (ViT)
patch_emb = nn.Conv2d(3, 768, kernel_size=16, stride=16)
patches = patch_emb(img).flatten(2).transpose(1, 2)  # (B, 196, 768)

# 5. BERT-style combined embedding
x = emb(token_ids) + pos_emb(positions) + seg_emb(segment_ids)

# 6. Subword tokenization (production)
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4")
tokens = enc.encode("transformerization")  # ['transform', 'er', 'ization']

Key Numbers to Remember

Model	V	d	Emb Params	Tied?	Position
GPT-2 Small	50,257	768	38.6M	Yes	Learned
BERT-Base	30,522	768	23.4M	Yes	Learned
Llama 2 7B	32,000	4,096	131M	No	RoPE
Llama 3 8B	128,256	4,096	525M	No	RoPE
ViT-B/16	N/A	768	0.6M	N/A	Learned

Connections

Embedding layers are the first thing that happens in every neural network that processes discrete inputs. Understanding them unlocks everything downstream:

Normalization — LayerNorm is typically applied right after the embedding addition. Why? Because adding three vectors of different scales can create magnitude issues that normalization fixes.
Optimizers — Embedding layers have unique optimization challenges: sparse gradients (only selected rows update), rare tokens (few gradient steps), and the weight tying constraint. Adam's per-parameter statistics handle these naturally.
Transformers — The embedding output is what enters the transformer stack. Every concept in the transformer lesson — attention, feed-forward networks, residual connections — operates on the vectors that embedding layers produce.
Vision Transformer — Patch embeddings are how ViT converts images into the sequence format that transformers require. The tradeoffs of patch size directly affect model quality and efficiency.

"What I cannot create, I do not understand." — Richard Feynman. You now know how to build an embedding layer from scratch: the lookup table, the addition of position and segment signals, the scaling trick, the weight tying optimization, the Conv2d patch projection, and the subword tokenization that makes it all work for real language. Every transformer starts here.

A model uses V=100,000, d=2,048, tied weights, and learned position embeddings (max_len=4,096). How many total embedding parameters does it have?

200M (two separate 100K×2048 matrices) 100K×2048 + 4096×2048 = 204.8M + 8.4M = ~213M (tied token + learned position) 100K×2048 = 204.8M (just the token table, positions are free) 413M (two token matrices + position matrix)