Training Foundations

Embedding Layers

How discrete tokens become dense vectors — the lookup tables, tied weights, and patch projections at the foundation of every neural network.

Prerequisites: What a matrix is + How neural networks process inputs. That's it.
10
Chapters
12+
Simulations
0
Assumed Knowledge

Chapter 0: From Integers to Meaning

The word "cat" enters a transformer as the integer 3782. The word "dog" enters as 5291. A neural network needs to do math with these — add them, multiply them, compute distances. But 3782 + 5291 = 9073, which means nothing. How do you turn arbitrary integers into vectors that capture meaning?

This isn't just a formatting problem. It's a representation problem. Integer IDs are categorical labels — they have no magnitude, no direction, no distance. Token 3782 isn't "closer" to token 3783 than to token 5291. The numbers were assigned arbitrarily when the tokenizer was built. They carry zero semantic information.

But neural networks live in continuous space. They multiply, add, and differentiate. They need inputs where "nearby" means "similar." We need a mapping from the discrete world of token IDs into a continuous vector space where similar meanings live at similar coordinates.

That mapping is called an embedding layer. And it's deceptively simple — it's just a table lookup.

The Lookup Table Idea

Imagine a spreadsheet. Each row is a word in your vocabulary. Each row has d columns — say, 8 numbers. The word "cat" is row 3782. When you need a vector for "cat," you just go to row 3782 and read off the 8 numbers. No multiplication. No activation function. Just indexing.

Those 8 numbers are the embedding vector for "cat." They're learned parameters — they start random and get adjusted by gradient descent during training. After training, words with similar meanings end up with similar vectors. "cat" and "dog" are close. "cat" and "parliamentary" are far apart.

Let's see this in action.

Embedding Lookup

Click a word to see its embedding vector (d=8 dimensions). Similar words have similar bar patterns. Below: a 2D projection showing how words cluster by meaning.

Click "cat" and then "dog." Notice how their bar charts look similar — both have high values in some of the same dimensions and low values in others. Now click "the." The pattern is completely different. That's because "cat" and "dog" are both animals, while "the" is a function word with entirely different grammatical and semantic properties.

The 2D projection at the bottom makes this even clearer. "cat," "dog," and "kitten" cluster together. "house" and "building" form another cluster. "the" sits alone. These clusters emerge naturally from training — the model learned that words used in similar contexts should have similar vectors.

Hand Calculation: Embedding Lookup

Let's trace this by hand with a tiny vocabulary. We have 6 words, and each embedding has 3 dimensions (d=3).

Setup. Vocabulary size V=6, embedding dimension d=3. The embedding matrix E has shape (6, 3) — one row per token, 3 numbers per row. That's V × d = 6 × 3 = 18 learnable parameters.

Here's our embedding matrix (values learned from training):

Token IDWordE[id][0]E[id][1]E[id][2]
0the-0.120.050.88
1cat0.72-0.410.15
2dog0.68-0.380.22
3sat-0.550.62-0.03
4house0.310.15-0.72
5on-0.080.110.79

Step 1: Token "cat" has ID 1. The embedding is row 1 of the matrix:

E[1] = [0.72, -0.41, 0.15]

That's it. No multiplication. No bias. No activation function. Just: go to row 1, read off 3 numbers. This is the entire operation.

Step 2: Token "dog" has ID 2. Its embedding is row 2:

E[2] = [0.68, -0.38, 0.22]

Step 3: How similar are "cat" and "dog"? Compute the Euclidean distance:

Now compare "cat" to "the":

"cat" is 14× closer to "dog" than to "the." The embedding space has organized itself so that semantic similarity maps to geometric proximity. This is the entire point of embeddings.

From Scratch, Then PyTorch

python
# Embedding from scratch — it's literally a matrix with indexed rows
import numpy as np

class Embedding:
    def __init__(self, vocab_size, embed_dim):
        # Random initialization — training will adjust these
        self.weight = np.random.randn(vocab_size, embed_dim) * 0.02

    def forward(self, token_ids):
        # That's it. Index into rows. No multiplication.
        return self.weight[token_ids]

# Usage
emb = Embedding(vocab_size=6, embed_dim=3)
ids = np.array([1, 2, 0])  # "cat", "dog", "the"
vectors = emb.forward(ids)     # shape: (3, 3)
# vectors[0] = emb.weight[1]  → cat's embedding
# vectors[1] = emb.weight[2]  → dog's embedding
# vectors[2] = emb.weight[0]  → the's embedding
python
# PyTorch equivalent — nn.Embedding wraps the same idea
import torch
import torch.nn as nn

emb = nn.Embedding(num_embeddings=6, embedding_dim=3)
print(emb.weight.shape)  # torch.Size([6, 3])

ids = torch.tensor([1, 2, 0])
vectors = emb(ids)  # shape: (3, 3) — same as weight[ids]

# Proof: embedding IS just indexing
assert torch.allclose(emb(ids), emb.weight[ids])  # True
An embedding layer is NOT a linear layer. A linear layer computes W × x + b for a continuous input x. An embedding layer takes an integer index and returns the corresponding row of a lookup table. No multiplication — just indexing. Under the hood, nn.Embedding(V, d) stores a matrix of shape (V, d) and returns matrix[token_id]. Mathematically, it's equivalent to multiplying a one-hot vector by the weight matrix, but no implementation actually does that — it would waste memory and compute on a vector that's all zeros except for a single 1.
What mathematical operation does an embedding layer perform?

Chapter 1: Token Embeddings in Detail

The embedding table is deceptively simple — it's just a big matrix. But it's also one of the largest parameter blocks in the entire model. For Llama 2 7B, the embedding matrix has 131 million parameters. That's more parameters than all of GPT-2 Small (124M).

Let's understand why it's so big, how it trains, and what can go wrong.

The Size of the Table

The token embedding matrix E has shape (V, d), where V is the vocabulary size (number of unique tokens) and d is the embedding dimension (length of each vector). Every entry is a learnable floating-point number.

Parameter count = V × d. Memory = V × d × bytes_per_param.

Memory arithmetic matters. In FP16 (2 bytes per number), each embedding vector of dimension d costs 2d bytes. Multiply by V tokens and you get the total embedding memory. This is a fixed cost — it doesn't depend on sequence length or batch size. It's just the weight of the table sitting in GPU memory.

Let's compute this for real models:

ModelVdParamsFP16 Memory
GPT-2 Small50,25776838.6M73 MB
BERT-Base30,52276823.4M45 MB
Llama 2 7B32,0004,096131.1M250 MB
Llama 3 8B128,2564,096525.4M1,001 MB
GPT-4 (est.)100,0008,192819.2M1,562 MB

Look at Llama 3 vs Llama 2. Same embedding dimension, but Llama 3 quadrupled the vocabulary (from 32K to 128K tokens). That quadrupled the embedding table from 250 MB to 1 GB. Vocabulary size is the dominant cost driver.

Why would anyone want a bigger vocabulary? Because larger vocabularies encode text more efficiently — fewer tokens per sentence, shorter sequences, lower inference cost. But the embedding table gets proportionally larger. This is a fundamental tradeoff in model design.

Embedding Table Size Explorer

Adjust vocabulary size and embedding dimension. The heatmap shows the embedding matrix (each cell is a parameter). Below: total parameter count and memory.

Vocab V 32,000
Dim d 4,096

How Training Works

At initialization, every row of the embedding matrix is filled with small random numbers (typically drawn from a normal distribution with std = 0.02). At this point, "cat" and "dog" have random, unrelated vectors. The model can't tell them apart any more than it can tell them from "parliament."

During training, each forward pass selects a subset of rows — one for each token in the batch. The loss gradient flows backward through the model and eventually reaches the embedding layer. But here's the key: only the rows that were selected get a gradient update. If "cat" (ID 1) appeared in the batch, row 1 gets updated. Row 5291 ("dog") is untouched.

This is fundamentally different from a linear layer, where every weight participates in every forward pass. In an embedding layer, each row trains independently — only when its token appears in a batch.

Forward Pass
Token IDs [4, 1, 0, 2] → index into E → vectors (4, d)
Loss Computed
Cross-entropy or other loss → single scalar
Backward Pass
Gradient ∂L/∂E[id] computed for ONLY the 4 selected rows
Update
E[4] -= lr · grad[4], E[1] -= lr · grad[1], ... (only 4 rows touched)

The Rare Token Problem

This selective updating creates a problem. Tokens that appear frequently — "the," "is," "a" — get updated thousands of times per epoch. Their embeddings are well-trained, nuanced, and stable. They live in exactly the right part of the vector space.

But rare tokens — "pneumonoultramicroscopicsilicovolcanoconiosis," someone's unusual name, a niche technical term — might appear 5 times in the entire training set. Five gradient updates. Their embeddings stay close to their random initialization, carrying almost no learned information.

Rare tokens DON'T get good embeddings. If a token appears 5 times in the training set, its embedding row gets 5 gradient updates. A token appearing 5 million times gets 5 million updates. This is why subword tokenization (BPE, SentencePiece) matters — it ensures no token is too rare by breaking rare words into common subwords. The word "unhappiness" becomes ["un", "happiness"] — both subwords appear frequently enough to have good embeddings.

Gradient Flow Through a Lookup

How does gradient descent work on a lookup table? Think of it this way: mathematically, looking up row i of matrix E is equivalent to computing the product eiT · E, where ei is a one-hot vector (all zeros except position i). No implementation actually materializes the one-hot vector — that would waste memory — but the gradient math works out the same way.

The gradient of the loss with respect to E[i] is simply the gradient that flowed back to this layer. For rows that weren't selected, the gradient is zero (they weren't used, so changing them can't change the loss). This is why optimizers like Adam keep per-parameter statistics — rare tokens get very different update patterns from frequent ones.

python
# Demonstrating gradient flow through embedding lookup
import torch
import torch.nn as nn

V, d = 6, 3
emb = nn.Embedding(V, d)

# Forward: look up tokens 1 ("cat") and 2 ("dog")
ids = torch.tensor([1, 2])
vecs = emb(ids)           # shape: (2, 3)

# Fake loss: sum all embedding values
loss = vecs.sum()
loss.backward()

# Only rows 1 and 2 got gradients
print(emb.weight.grad[0])  # tensor([0., 0., 0.])  ← "the" untouched
print(emb.weight.grad[1])  # tensor([1., 1., 1.])  ← "cat" got gradient
print(emb.weight.grad[2])  # tensor([1., 1., 1.])  ← "dog" got gradient
print(emb.weight.grad[3])  # tensor([0., 0., 0.])  ← "sat" untouched
Weight tying. Many models (GPT-2, T5, Llama) share the embedding matrix with the output projection layer. The same (V, d) matrix is used to convert token IDs to vectors at the input AND to convert hidden states back to vocabulary logits at the output. This halves the embedding memory and acts as a regularizer — the model must find embeddings that work well in both directions. This is called weight tying or shared embeddings.
For a model with vocabulary V=50,000 and embedding dimension d=2,048 stored in FP16, how much memory does the embedding table use?

Chapter 2: Combining Embeddings — The Input Recipe

A single token needs more than just its word meaning. It also needs to know where it is in the sequence (position 0? position 47?) and sometimes which segment it belongs to (sentence A? sentence B?). A transformer doesn't process tokens one at a time like an RNN — it sees the whole sequence at once. Without explicit position information, it can't tell "dog bites man" from "man bites dog."

The solution: add multiple embedding vectors together. Each embedding encodes a different aspect of the input. They all live in the same d-dimensional space, and the model learns to disentangle them.

BERT's Three-Part Recipe

BERT computes the input to its transformer stack as:

input = Etoken[token_id] + Eposition[pos] + Esegment[seg_id]

Three separate lookup tables. Three separate row indices. Three vectors of the same dimension d. Added element-wise.

TableShapeWhat it encodesHow many rows?
Etoken(V, d)Word identity ("cat", "sat", "on")V = 30,522 (BERT)
Eposition(max_len, d)Position in sequence (0th, 1st, 47th)max_len = 512 (BERT)
Esegment(2, d)Which sentence (A or B)2 (just two segments)

Total embedding parameters for BERT-Base (d=768): 30,522 × 768 + 512 × 768 + 2 × 768 = 23,440,896 + 393,216 + 1,536 = 23,835,648. The token table dominates — it's 98.3% of the embedding parameters.

GPT's Two-Part Recipe

GPT-2 and GPT-3 use a simpler recipe — no segment embedding, because they don't do sentence-pair tasks:

input = Etoken[token_id] + Eposition[pos]

Modern models like Llama and GPT-4 go even further. They use Rotary Position Embeddings (RoPE) instead of a learned position table. RoPE injects position information inside the attention mechanism, not at the input. This means the only embedding table is Etoken — the position information is handled differently (covered in the Positional Encoding lesson).

Hand Calculation: Adding Embeddings

Let's trace BERT's recipe by hand with d=4.

Setup. Token "cat" has ID 2. It sits at position 3 in the sequence. It belongs to segment 0 (sentence A). Each embedding table has d=4 columns.

Step 1: Look up each embedding.

Step 2: Add element-wise.

input = [0.34, -0.52, -0.07, 0.84]

This single vector now encodes three things: the word is "cat," it's at position 3, and it belongs to sentence A. All compressed into 4 numbers. The transformer layers that follow will learn to disentangle these signals.

Embedding Addition

Three embedding vectors for a single token. Toggle each component on/off to see its contribution. Change the position slider to watch the position embedding shift while the token embedding stays fixed.

Position 3

Drag the position slider and watch. The token embedding (orange bars) stays fixed — "cat" is "cat" regardless of where it appears. The position embedding (teal bars) changes — position 0 has a different pattern than position 7. The sum below shifts accordingly.

Toggle off the position component. Now the sum looks almost identical to just the token embedding. Toggle off the token component and leave only position. Now the sum looks nothing like "cat" — it's pure positional information. The model learns to use different dimensions for different types of information, making addition work despite all three signals occupying the same vector space.

Why Add Instead of Concatenate?

The obvious alternative to addition is concatenation. Instead of adding three d-dimensional vectors, concatenate them into one 3d-dimensional vector. This keeps the information perfectly separated — no interference between token meaning and position.

So why doesn't anyone do this?

Why ADD instead of CONCATENATE? Concatenation would triple the dimension (d + d + d = 3d for BERT). Every subsequent layer — every attention head, every feed-forward network — would need to operate on 3d-dimensional inputs instead of d. This triples the compute and memory for the entire model, not just the embedding layer. Addition keeps the dimension at d, which means zero changes to the rest of the architecture. The model learns to use different subsets of dimensions for different information types — some dimensions primarily encode position, others primarily encode token identity — through training.

There's a deeper reason too. Addition is information-lossy in theory but sufficient in practice. With d=768 or d=4096, there are enough dimensions for the model to allocate different "channels" to different signal types. Research has shown that trained position embeddings and token embeddings are nearly orthogonal — they naturally learn to use different directions in the high-dimensional space, minimizing interference.

From Scratch: BERT's Embedding Layer

python
import torch
import torch.nn as nn

class BERTEmbedding(nn.Module):
    def __init__(self, vocab_size, max_len, d_model, n_segments=2):
        super().__init__()
        self.token_emb = nn.Embedding(vocab_size, d_model)
        self.pos_emb   = nn.Embedding(max_len, d_model)
        self.seg_emb   = nn.Embedding(n_segments, d_model)
        self.norm       = nn.LayerNorm(d_model)  # BERT normalizes after adding
        self.drop       = nn.Dropout(0.1)

    def forward(self, token_ids, segment_ids):
        # token_ids: (batch, seq_len)  — integer token IDs
        # segment_ids: (batch, seq_len) — 0 or 1
        seq_len = token_ids.size(1)
        positions = torch.arange(seq_len, device=token_ids.device)

        # Three lookups, one addition
        x = self.token_emb(token_ids)    # (batch, seq, d)
        x = x + self.pos_emb(positions)  # broadcast: (seq, d) → (batch, seq, d)
        x = x + self.seg_emb(segment_ids)
        return self.drop(self.norm(x))

# Usage
emb = BERTEmbedding(vocab_size=30522, max_len=512, d_model=768)
tokens = torch.randint(0, 30522, (2, 128))  # batch=2, seq=128
segs   = torch.zeros(2, 128, dtype=torch.long)
out    = emb(tokens, segs)  # (2, 128, 768)
python
# HuggingFace: it's already inside the model
from transformers import BertModel
model = BertModel.from_pretrained("bert-base-uncased")

# The three embedding tables live here:
print(model.embeddings.word_embeddings.weight.shape)      # (30522, 768)
print(model.embeddings.position_embeddings.weight.shape)  # (512, 768)
print(model.embeddings.token_type_embeddings.weight.shape)# (2, 768)
In BERT's input embedding, what three types of information are combined?

Chapter 3: Patch Embeddings — Images as Token Sequences

Transformers process sequences. Text is a sequence of tokens — that's natural. But images aren't sequences. They're 2D grids of pixels. How do you feed an image into a transformer?

You could treat every pixel as a token. A 224×224 RGB image has 150,528 pixel values. But attention is O(n²) in sequence length — computing attention over 150K tokens is computationally impossible. Even with modern hardware, that's 22.6 billion attention scores per layer.

The Vision Transformer (ViT) solved this with a simple trick: chop the image into a grid of non-overlapping patches, treat each patch as a "word," and run the same transformer architecture. A 224×224 image with 16×16 patches gives 196 "tokens" — a manageable sequence that attention can process.

The Mechanics

Here's exactly what happens, step by step:

Image
(3, 224, 224) — RGB, 224 pixels per side
Split into Patches
224 ÷ 16 = 14 → 14 × 14 = 196 patches
Flatten Each Patch
Each patch: 16 × 16 × 3 = 768 pixel values
Linear Projection
768 → dmodel via learned weight matrix
Sequence of Embeddings
(196, dmodel) — same format as text tokens

Notice the coincidence: for ViT-Base with 16×16 patches, each flattened patch has 16 × 16 × 3 = 768 values — exactly the embedding dimension d=768. So the linear projection is a square matrix (768×768). This wasn't a design accident.

Hand Calculation: A Tiny Image

Let's work through a tiny example. Grayscale image (1 channel), 6×6 pixels, patch size 3×3.

Setup. Image: 6×6, 1 channel. Patch size: 3×3. That gives 6/3 = 2 rows and 2 columns of patches = 4 patches total. Each patch has 3 × 3 × 1 = 9 pixel values. We'll project to d=4.

Here's our 6×6 image (pixel values 0-9):

123789
450654
781321
234876
567543
890210

Patch 0 (top-left 3×3 block):

Patch 1 (top-right):

Patch 2 (bottom-left):

Patch 3 (bottom-right):

Linear projection: Weight matrix W is (9, 4). Let's project Patch 0:

For simplicity, suppose the projection of [1, 2, 3, 4, 5, 0, 7, 8, 1] through W gives us [0.42, -0.18, 0.73, 0.05]. That 4D vector is Patch 0's patch embedding. It's now a "token" that attention can process, just like the word "cat" in a language model.

After projecting all 4 patches, we have a sequence of length 4, each element a d=4 vector. The transformer processes this exactly like a 4-token text sequence.

The Conv2d Trick

The flatten-then-project operation can be implemented as a single convolution with kernel_size = patch_size and stride = patch_size. This is not just clever engineering — it's mathematically identical.

A convolution with kernel_size=16 and stride=16 slides a 16×16 window across the image, moving exactly 16 pixels each step (no overlap). At each position, it computes a weighted sum of the 768 pixel values (16×16×3) to produce one output value per filter. With dmodel filters, each position produces a d-dimensional vector. Each position corresponds to one patch.

python
# Patch embedding from scratch: reshape + linear
import torch
import torch.nn as nn

class PatchEmbedScratch(nn.Module):
    def __init__(self, img_size=224, patch_size=16, in_channels=3, d_model=768):
        super().__init__()
        self.patch_size = patch_size
        self.n_patches = (img_size // patch_size) ** 2  # 196
        patch_dim = in_channels * patch_size * patch_size  # 768
        self.proj = nn.Linear(patch_dim, d_model)

    def forward(self, x):
        # x: (batch, channels, H, W)
        B, C, H, W = x.shape
        p = self.patch_size
        # Reshape: (B, C, H, W) → (B, n_patches, patch_dim)
        x = x.unfold(2, p, p).unfold(3, p, p)  # (B, C, nH, nW, p, p)
        x = x.contiguous().view(B, C, -1, p * p)  # (B, C, n_patches, p²)
        x = x.permute(0, 2, 1, 3).reshape(B, -1, C * p * p)  # (B, n_patches, patch_dim)
        return self.proj(x)  # (B, n_patches, d_model)
python
# The Conv2d trick: mathematically identical, faster on GPU
class PatchEmbedConv(nn.Module):
    def __init__(self, img_size=224, patch_size=16, in_channels=3, d_model=768):
        super().__init__()
        # One conv: kernel=16, stride=16 → no overlap, each position = one patch
        self.proj = nn.Conv2d(in_channels, d_model,
                              kernel_size=patch_size, stride=patch_size)

    def forward(self, x):
        # x: (B, 3, 224, 224)
        x = self.proj(x)  # (B, d_model, 14, 14)
        x = x.flatten(2)   # (B, d_model, 196)
        x = x.transpose(1, 2)  # (B, 196, d_model) — sequence of patch embeddings
        return x

# Both produce identical shapes:
img = torch.randn(1, 3, 224, 224)
scratch = PatchEmbedScratch()
conv = PatchEmbedConv()
print(scratch(img).shape)  # (1, 196, 768)
print(conv(img).shape)     # (1, 196, 768)

Patch Size: The Resolution-Speed Tradeoff

Patch size is the single most impactful hyperparameter in ViT. It controls both the resolution of the representation and the computational cost.

Patch SizePatches (224×224)Attention Cost (n²)Detail Level
32 × 32492,401Low — misses fine details
16 × 1619638,416Standard — good balance
8 × 8784614,656High — captures fine detail
4 × 43,1369,834,496Very high — costly

Going from patch size 16 to 8 quadruples the number of patches, which increases the attention cost by 16× (since attention is quadratic). This is why ViT-B/16 (patch size 16) is far more common than ViT-B/8 in practice — the accuracy gain from smaller patches rarely justifies the 16× compute increase.

Patch Embedding Visualizer

An image split into patches. Click any patch to see its flattened pixel vector and projected embedding. Adjust patch size to see the resolution-speed tradeoff.

Patch Size 16×16

Click different patches and watch the flattened vector change. Patches showing sky have low-contrast, similar pixel values — their vectors are "boring." Patches with edges or objects have high-variance pixels — their vectors are more "interesting." The linear projection learns to extract the patterns that matter for the downstream task.

Patch embeddings are NOT the same as CNN features. A standard CNN builds features hierarchically over many layers — edges in layer 1, textures in layer 2, parts in layer 3, objects in layer 4. ViT's patch embedding is a single linear projection — no nonlinearity, no stacking, no receptive field growth. All the feature learning happens in the transformer layers afterward, not in the patch embedding. This is why ViT needs more data than CNNs — without the inductive biases of convolution (locality, translation equivariance), the transformer must learn spatial structure from scratch.

The [CLS] Token

ViT adds one extra token at the beginning of the sequence: the [CLS] token. This is a learnable embedding (not from any patch) that aggregates information from the entire image through attention. After the transformer, the [CLS] token's final hidden state is used for classification.

So the final input sequence is actually 197 tokens for a 224×224 image with 16×16 patches: 1 [CLS] + 196 patch embeddings. Each token also gets a learnable position embedding added (similar to BERT's recipe from Chapter 2), so the model knows which patch is top-left vs. bottom-right.

python
# ViT's full embedding: patch projection + [CLS] + position
class ViTEmbedding(nn.Module):
    def __init__(self, img_size=224, patch_size=16, d_model=768):
        super().__init__()
        n_patches = (img_size // patch_size) ** 2  # 196

        # Patch embedding via Conv2d
        self.patch_emb = nn.Conv2d(3, d_model,
                                   kernel_size=patch_size, stride=patch_size)

        # Learnable [CLS] token — shape (1, 1, d)
        self.cls_token = nn.Parameter(torch.randn(1, 1, d_model) * 0.02)

        # Position embeddings for CLS + all patches
        self.pos_emb = nn.Parameter(torch.randn(1, n_patches + 1, d_model) * 0.02)

    def forward(self, x):
        B = x.shape[0]
        # Patch embeddings: (B, 3, 224, 224) → (B, 196, 768)
        patches = self.patch_emb(x).flatten(2).transpose(1, 2)

        # Prepend [CLS] token
        cls = self.cls_token.expand(B, -1, -1)  # (B, 1, 768)
        x = torch.cat([cls, patches], dim=1)     # (B, 197, 768)

        # Add position embeddings
        x = x + self.pos_emb  # (B, 197, 768)
        return x
How does ViT convert a 224×224 image with 16×16 patches into a sequence for the transformer?

Chapter 4: Tied Embeddings

The input embedding maps token IDs to vectors. The output layer maps vectors back to token scores. Both are matrices of shape vocabulary × dimension. What if they were the same matrix? Using it forward to embed, using its transpose to predict?

That's weight tying, and it cuts your embedding parameters in half while often making the model better.

The Output Layer Problem

A language model's final job is to predict the next token. It has a hidden state h — a vector of dimension d that summarizes everything the model has read so far. It needs to turn h into a score for every token in the vocabulary. Token 0 gets a score, token 1 gets a score, all the way up to token V-1.

The standard approach: multiply h by a weight matrix Wout of shape (V, d). The result is a vector of V logits — one score per vocabulary token. The highest logit wins.

logits = h · WoutT

But wait. Wout has shape (V, d). The input embedding matrix E also has shape (V, d). Both map between a vocabulary-sized space and a d-dimensional space. They're doing mirror-image jobs.

How Tying Works

Weight tying sets Wout = E. The same matrix serves double duty:

The dot product h · E[i] measures how similar the hidden state is to token i's embedding. High similarity = high score = model predicts that token. This creates a beautiful symmetry: tokens whose embeddings are close in the embedding space produce similar predictions at the output.

Why does this make sense? Think about what each matrix row means. Row i of the input embedding E is "what does token i look like as a vector?" Row i of the output W is "what should the hidden state look like when predicting token i?" Weight tying says these should be the same representation. If a token's embedding points in a certain direction, the hidden state should point in that same direction when the model wants to predict that token.

Hand Calculation: Tied Output Projection

Let's trace through a concrete example. V = 4 tokens, d = 3 dimensions.

Embedding matrix E (4×3):

Tokend0d1d2
0 ("the")0.50.3-0.1
1 ("cat")0.8-0.20.4
2 ("sat")0.10.90.3
3 ("on")-0.30.50.6

The model has processed the sentence and produced a hidden state h = [0.6, 0.1, 0.3]. We now compute logits = h · ET, which means taking the dot product of h with each row of E:

Token 0 ("the"):

Token 1 ("cat"):

Token 2 ("sat"):

Token 3 ("on"):

Logits: [0.30, 0.58, 0.24, 0.05]. Token 1 ("cat") has the highest score. The hidden state h is most similar to the embedding for "cat," so the model predicts "cat" as the next token.

The Parameter Savings

Let's count parameters for a real model. Llama 2 7B: V = 32,000 tokens, d = 4,096 dimensions.

Without TyingWith Tying
Input embedding E32,000 × 4,096 = 131M32,000 × 4,096 = 131M
Output projection W32,000 × 4,096 = 131MShared with E = 0
Total embedding params262M131M
Memory (FP16)~500 MB~250 MB

That's 131 million fewer parameters — about 250 MB of memory saved in FP16 precision. For larger vocabularies (Llama 3's 128K tokens), the savings are even more dramatic: 128,000 × 4,096 = 524M params saved.

But tying isn't only about saving memory. The shared representation acts as a constraint.

Tying is regularization, not just compression. Without tying, the input and output matrices can learn completely different representations. The input embedding might place "cat" and "kitten" close together, while the output matrix puts them far apart. With tying, the model is forced to use a single representation that works for both embedding tokens and predicting them. This constraint acts as an inductive bias and often improves generalization — the model can't "cheat" by learning separate, inconsistent spaces.

Who Ties and Who Doesn't

Weight tying is common but not universal:

ModelTies Embeddings?Notes
GPT-2YesOne of the earliest popular tied LMs
BERTYesTies input embeddings with MLM head
T5YesEnc/dec share the same embedding matrix
Llama 2 (7B)NoSeparate input and output matrices
Llama 3 (8B)No128K vocab makes tying awkward with GQA
GemmaYesTies plus scales input embeddings
MistralNoSeparate matrices

The trend: smaller models benefit more from tying (the embedding parameters are a bigger fraction of total). Larger models can afford separate matrices and sometimes achieve marginally better performance without tying.

The Simulation

Below, we visualize the relationship between input embeddings and output projections. Toggle "tied" to see the matrices link together. The hidden state projects through the embedding matrix to produce scores — tokens closer to h in embedding space get higher logits.

Tied vs Untied Embeddings

Toggle tying on/off. Watch how the output logits change and the parameter count updates. Click a token to see its dot product with the hidden state.

Untied: 262M params (2 matrices)

In Code

python
import torch
import torch.nn as nn

class TiedLM(nn.Module):
    def __init__(self, vocab_size, d_model):
        super().__init__()
        self.embed = nn.Embedding(vocab_size, d_model)
        # Output projection shares the embedding weight
        self.lm_head = nn.Linear(d_model, vocab_size, bias=False)
        self.lm_head.weight = self.embed.weight  # THE TIE

    def forward(self, token_ids, hidden):
        # Input: embed tokens
        x = self.embed(token_ids)       # (B, T) -> (B, T, d)
        # ... transformer layers process x into hidden ...
        # Output: logits = hidden @ E.T
        logits = self.lm_head(hidden)  # (B, T, d) -> (B, T, V)
        return logits

# Verify they share memory
model = TiedLM(32000, 4096)
print(model.embed.weight.data_ptr() == model.lm_head.weight.data_ptr())
# True — same tensor in memory

The critical line is self.lm_head.weight = self.embed.weight. After this assignment, both layers reference the exact same tensor in memory. Gradient updates from the output loss flow into the same parameters that define the input embeddings. One gradient update improves both embedding and prediction simultaneously.

What does weight tying between input and output embeddings achieve?

Chapter 5: Embedding Scaling

You've combined token and position embeddings by addition. But what if one is much larger than the other? If position embeddings have values around 1.0 and token embeddings have values around 0.02, the position signal dominates. The model knows where every token is but barely knows what it is.

This is exactly what happens with the original Transformer's initialization. And the fix is a single multiplication.

The Scale Mismatch

When we initialize an embedding matrix, each entry is drawn from a distribution with standard deviation roughly 1/√d. For a model with d = 512, that means each element is about ±0.044.

How big is an entire embedding vector? Each of its d elements has variance 1/d, so the vector's squared magnitude (sum of squared elements) is approximately d × (1/d) = 1. The magnitude of a token embedding is roughly √1 = 1.0.

Now consider sinusoidal position embeddings. Each element is sin(…) or cos(…), so values range from -1 to +1. The squared magnitude is approximately d × 0.5 = d/2. The magnitude is roughly √(d/2).

For d = 512: position magnitude ≈ √256 = 16.0. Token magnitude ≈ 1.0. When we add them, position contributes 16× more than token identity. The model is almost entirely position, with a tiny whisper of "what token is this?"

The analogy. Imagine mixing a whisper with a shout. The whisper carries critical information (the token's identity), and the shout carries less critical information (where in the sequence we are). Without amplifying the whisper, it gets drowned out. Scaling by √d is turning up the volume on the whisper until it matches the shout.

The Fix: Scale by √d

Escaled = E[token_id] × √dmodel

After multiplying by √d, each element goes from std ≈ 1/√d to std ≈ 1. The vector's magnitude goes from ≈1 to ≈ √d. Now it matches the positional embedding magnitude.

Hand Calculation: Magnitude Comparison

Let d = 512. A token embedding vector e has d elements, each with std = 1/√512 ≈ 0.0442.

Without scaling:

Sinusoidal position embedding:

Ratio without scaling: position/token = 16.0/1.0 = 16:1. Position dominates.

With scaling (multiply e by √512 ≈ 22.6):

Now both signals contribute comparably. The model can distinguish both what a token is and where it is from the very first layer.

A Concrete Example

Let's use d = 4 for a tiny example we can trace completely.

Token embedding (initialized with std = 1/√4 = 0.5):

e = [0.3, -0.5, 0.2, 0.4]

Magnitude: √(0.09 + 0.25 + 0.04 + 0.16) = √0.54 ≈ 0.735

Positional embedding (sinusoidal):

p = [0.84, 0.54, 0.91, -0.42]

Magnitude: √(0.71 + 0.29 + 0.83 + 0.18) = √2.01 ≈ 1.418

Without scaling: combined = e + p

[0.3+0.84, -0.5+0.54, 0.2+0.91, 0.4-0.42] = [1.14, 0.04, 1.11, -0.02]

The result is dominated by the position values (0.84, 0.54, 0.91, -0.42). The token's contribution is barely visible — the 0.3 gets swamped by 0.84, the -0.5 nearly cancels with 0.54.

With scaling: escaled = e × √4 = e × 2

escaled = [0.6, -1.0, 0.4, 0.8]

combined = escaled + p

[0.6+0.84, -1.0+0.54, 0.4+0.91, 0.8-0.42] = [1.44, -0.46, 1.31, 0.38]

Now both signals are clearly visible. The -1.0 from the token is preserved alongside the 0.54 from position, giving -0.46 instead of a washed-out 0.04.

When Scaling Is and Isn't Used

ArchitectureScales?Position TypeWhy
Original TransformerYes (√d)Sinusoidal (fixed)Fixed position values need token magnitudes to match
BERTNoLearnedLearned positions adapt their scale during training
GPT-2NoLearnedSame reason as BERT
LlamaNoRoPE (rotation)RoPE rotates Q/K, doesn't add to embedding
GemmaYes (√d)RoPEGoogle's choice for tied+scaled embeddings
T5NoRelative biasPosition is in attention bias, not added to embedding

The pattern: scaling is mainly needed when fixed (sinusoidal) position embeddings are added to small-initialized token embeddings. Modern LLMs using RoPE don't add position to the embedding at all — they rotate the query and key vectors. No addition means no magnitude mismatch.

Not all transformers need embedding scaling. It was in the original "Attention Is All You Need" paper because sinusoidal position embeddings have fixed magnitude ≈√(d/2), while token embeddings initialize small. Modern LLMs using RoPE don't need it — RoPE rotates Q and K instead of adding to the embedding. If you see √d scaling in a codebase, check whether the architecture actually requires it or if it's cargo-culted from the original paper.

The Simulation

Adjust the scaling factor below to see how token and position signals combine. At low scale, position dominates (all tokens look similar in the combined representation). At √d, they balance. At very high scale, token dominates and position information is lost.

Embedding Scale Balance

Drag the scale factor. Watch the token-to-position ratio and the combined vectors. The sweet spot is where both bars are roughly equal.

Scale factor 1.0
d_model 512
Token:Position ratio = 1.0 : 16.0

In Code

python
import torch
import torch.nn as nn
import math

class ScaledEmbedding(nn.Module):
    """Original Transformer embedding with sqrt(d) scaling."""
    def __init__(self, vocab_size, d_model, max_len=5000):
        super().__init__()
        self.d_model = d_model
        self.tok_embed = nn.Embedding(vocab_size, d_model)
        self.pos_embed = self._sinusoidal(max_len, d_model)

    def forward(self, token_ids):
        seq_len = token_ids.size(1)
        tok = self.tok_embed(token_ids)              # (B, T, d)
        tok = tok * math.sqrt(self.d_model)          # SCALE!
        pos = self.pos_embed[:seq_len].unsqueeze(0)  # (1, T, d)
        return tok + pos                              # balanced addition

The single line tok * math.sqrt(self.d_model) makes the difference between a model that can barely distinguish tokens and one where token identity and position are balanced from the start.

Why does the original Transformer multiply token embeddings by √dmodel?

Chapter 6: The Embedding Space Explorer

Everything we've learned — token IDs mapped to vectors, embeddings trained by gradient descent, parameter sharing, magnitude balancing — converges into one central question: what does the learned embedding space actually look like?

Random initialization scatters tokens across the space with no structure. Training sculpts that space: synonyms cluster together, antonyms push apart, and semantic relationships emerge as geometric patterns. The explorer below lets you watch this process unfold in real time.

What You'll See

The canvas shows a 2D projection of a simulated embedding space. Each dot is a token. At step 0 (random initialization), the dots are scattered with no pattern. As training progresses, semantic clusters emerge. Animals drift toward animals. Colors group with colors. The structure you see is the model discovering meaning.

Play with the controls:

What to try. Start with "Animals & Colors" theme, dimension 8, and press Play. Watch the clusters separate. Then switch to dimension 2 and replay — see how the low-dimensional space struggles to separate them cleanly. Switch to "Mixed" theme for a harder clustering problem. Click any token to see its embedding vector as a bar chart on the right.
Embedding Space Explorer

Watch random vectors organize into semantic clusters as training progresses. Click a token to inspect its embedding.

Training step 0 / 500
Click a token to inspect its embedding vector.

What the Training Is Doing

In a real language model, embedding training happens implicitly through the language modeling objective: predict the next token. Tokens that appear in similar contexts develop similar embeddings because the gradients push them in similar directions.

Our simulation mimics this with a simplified attraction/repulsion model. Tokens in the same semantic category attract each other (their embeddings move closer). Tokens in different categories repel slightly. The result approximates what real language model training produces.

Three things to notice as training progresses:

The dimension slider reveals the information bottleneck. At dim=2, the space literally doesn't have enough directions to separate all clusters cleanly. Some groups must overlap. At dim=32, there's room for nuance — every token gets its own corner. This is why real models use d=768 or d=4096: the embedding space needs to represent the entire vocabulary with all its semantic relationships, and that requires a high-dimensional canvas.

Chapter 7: Subword Embeddings

The word "transformerization" might appear 3 times in your entire training set. Its embedding gets 3 gradient updates — barely better than random. But break it into subwords: "transform," "er," "ization." Each piece appears thousands of times. The composed representation from well-trained subwords is far better than a single undertrained whole-word embedding.

This is the core insight of subword tokenization, and it's why every modern language model uses it.

The Whole-Word Problem

Suppose we give every word its own embedding. English has roughly 170,000 words in common use. Add technical terms, names, misspellings, and multilingual support, and you easily hit a million. That's a million rows in the embedding matrix, most of them barely trained.

Worse: a new word the model has never seen gets no embedding at all. The entire model breaks on a single unfamiliar word. This is the out-of-vocabulary (OOV) problem, and it plagued NLP for decades.

BPE: Byte Pair Encoding

Byte Pair Encoding (Sennrich et al., 2016) solves both problems with a simple algorithm. Start with a character-level vocabulary (every letter is a token). Then repeatedly merge the most frequent adjacent pair into a new token.

Start
Vocabulary = all characters: a, b, c, ..., z, space, etc.
Scan
Find the most frequent adjacent pair in the training text.
Merge
Replace every occurrence of that pair with a new token.
Repeat
Continue until vocabulary reaches target size (e.g., 32,000).
↻ iterate

Hand Calculation: BPE Merges

Let's run BPE on a tiny corpus. Our training text:

text
"the cat sat on the mat the cat"

Step 0: Character tokens

['t','h','e',' ','c','a','t',' ','s','a','t',' ','o','n',' ','t','h','e',' ','m','a','t',' ','t','h','e',' ','c','a','t']

Count all adjacent pairs:

PairCount
t, h3
h, e3
a, t4
c, a2
e, ␣3
␣, t2
␣, c2
␣, m1
s, a1
others1 each

Merge 1: Most frequent pair is "a"+"t" (count 4). Create new token "at".

Result: ['t','h','e',' ','c','at',' ','s','at',' ','o','n',' ','t','h','e',' ','m','at',' ','t','h','e',' ','c','at']

Merge 2: Now "t"+"h" appears 3 times. Create "th".

Result: ['th','e',' ','c','at',' ','s','at',' ','o','n',' ','th','e',' ','m','at',' ','th','e',' ','c','at']

Merge 3: "th"+"e" appears 3 times. Create "the".

Result: ['the',' ','c','at',' ','s','at',' ','o','n',' ','the',' ','m','at',' ','the',' ','c','at']

After just 3 merges, "the" is a single token. High-frequency words collapse quickly. Rare words stay decomposed into common pieces.

Why Subword Embeddings Are Better

Consider the word "unhappiness" (appears rarely) versus its subwords:

PieceFrequencyGradient UpdatesEmbedding Quality
"unhappiness" (whole word)~50~50Poor (barely trained)
"un" (prefix)~500,000~500,000Excellent
"happi" (stem)~100,000~100,000Excellent
"ness" (suffix)~300,000~300,000Excellent

The subword approach gives us three well-trained embeddings instead of one barely-trained one. The transformer's attention layers then compose these subword embeddings into a word-level representation. Attention learns that "un" means negation, "happi" carries the core meaning, and "ness" marks a noun. This compositional understanding transfers to every word with these subwords — "unhelpfulness," "happiness," "sadness" all benefit.

Subword embeddings are compositional, not concatenated. Each subword gets its own embedding vector (looked up from the embedding matrix just like any token). These vectors are then processed by the transformer's attention layers, which learn to compose subword meanings into word meanings. The embedding layer provides the raw ingredients; attention does the cooking.

The Vocabulary Size Tradeoff

The number of BPE merges determines the vocabulary size. This is a critical hyperparameter with opposing forces:

Small V (8K)Medium V (32K)Large V (100K)
Tokens per word3-5 (many subwords)1-2 (fewer subwords)1 (often whole word)
Sequence lengthLong (more tokens per sentence)ModerateShort (fewer tokens)
Subword frequencyVery high (well-trained)HighLow for rare tokens
Embedding table size8K × d (small)32K × d (moderate)100K × d (large)
OOV handlingNever (can spell anything)NeverNever (but rare tokens = poor)
Compute costHigh (attention is O(n²))BalancedLower per-token

Most modern LLMs settle on V = 32K-128K as the sweet spot. Llama uses 32K. GPT-4 uses ~100K. Llama 3 moved to 128K to improve multilingual coverage (more languages means more unique subwords needed).

The Simulation

Type any word below to see how it gets tokenized under different vocabulary sizes. The simulation shows which subwords are well-trained (green, high frequency) versus barely trained (red, low frequency). Notice how smaller vocabularies split words into more pieces, but each piece is individually better trained.

Subword Tokenizer Explorer

Type a word and see how BPE splits it at different vocabulary sizes. Green subwords are well-trained; red ones are rare.

Vocab size 32K
Tokens: 3 | Total subword frequency: high

In Code

python
# Simplified BPE from scratch
def get_pairs(tokens):
    """Count all adjacent pairs."""
    pairs = {}
    for i in range(len(tokens) - 1):
        pair = (tokens[i], tokens[i+1])
        pairs[pair] = pairs.get(pair, 0) + 1
    return pairs

def merge(tokens, pair, new_token):
    """Replace every occurrence of pair with new_token."""
    result = []
    i = 0
    while i < len(tokens):
        if i < len(tokens)-1 and tokens[i]==pair[0] and tokens[i+1]==pair[1]:
            result.append(new_token)
            i += 2
        else:
            result.append(tokens[i])
            i += 1
    return result

# Run BPE
text = "the cat sat on the mat"
tokens = list(text)            # start with characters
num_merges = 10

for _ in range(num_merges):
    pairs = get_pairs(tokens)
    if not pairs: break
    best = max(pairs, key=pairs.get)
    new_tok = best[0] + best[1]
    tokens = merge(tokens, best, new_tok)
    print(f"Merge: '{best[0]}' + '{best[1]}' -> '{new_tok}'  Tokens: {tokens}")
python
# Production tokenizers: tiktoken (GPT) and sentencepiece (Llama)
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4")
tokens = enc.encode("transformerization")
print([enc.decode([t]) for t in tokens])
# ['transform', 'er', 'ization']

print(len(tokens))  # 3 subword tokens

# Compare: a common word is a single token
print(enc.encode("the"))   # [1820] — one token
print(enc.encode("hello")) # [15339] — one token
Subword embeddings are NOT just concatenated. Each subword gets its own embedding vector, looked up independently from the embedding table. These separate vectors enter the transformer as separate tokens in the sequence. The attention mechanism then learns to compose subword meanings into word-level representations across layers. This is why "un" + "happi" + "ness" doesn't just mean their embeddings glued together — it means the transformer has multiple layers to recognize the negation prefix, the emotional stem, and the noun suffix, and to build a coherent whole-word representation from these pieces.

The Embedding Quality Hierarchy

Not all embeddings are created equal. The number of gradient updates a token receives during training directly determines how well its embedding represents its meaning:

Token TypeTraining FrequencyEmbedding QualityExample
Stop wordsBillionsExcellent"the", "is", "and"
Common subwordsMillionsVery good"ing", "tion", "pre"
Common wordsHundred thousandsGood"transformer", "network"
Rare subwordsThousandsAdequate"zyg", "qph"
Single charactersMillions (as fallback)Good (but carry less meaning)"x", "z", "7"

This is why the worst case for subword tokenization is still decent: even a word split into individual characters uses embeddings that have each seen millions of training examples. The model can still compose meaning from well-trained character embeddings — it just takes more layers of attention to do so.

Why does subword tokenization produce better embeddings for rare words than whole-word tokenization?

Chapter 8: Embedding Strategy Arena

You've learned six different ways to build embedding layers: small vocabularies with tied weights, large vocabularies with untied weights, subword BPE, character-level, and two patch sizes for vision. Each has tradeoffs in parameter count, embedding quality, sequence length, and compute cost. But which one wins on a real task?

That depends entirely on the task. The arena below lets you race all six strategies head-to-head across four different challenges. No single strategy dominates everywhere — that's the whole point of understanding the tradeoffs.

The Six Strategies

StrategyVdParamsTokens/Input
Small V + Tied8,0007686.1M (shared)~40 (long sequences)
Large V + Untied128,0004,0961,049M (two matrices)~15 (short sequences)
Subword BPE (32K)32,00076824.6M~25 (balanced)
Character-Level2567680.2M~120 (very long)
Patch 16 (ViT)N/A7680.6M (projection)196 patches
Patch 32 (ViT)N/A7682.4M (projection)49 patches

The Four Tasks

Each task favors different strengths. Watch which strategies rise and fall:

Embedding Strategy Arena

Select a task and watch the six strategies compete. Bars show simulated performance scores (higher = better). Click a bar for details on why each strategy scores the way it does.

Task: Language Modeling — click a bar for details.

Reading the Results

No strategy wins every task. That's the deepest lesson here:

There is no universal embedding strategy. The right choice depends on your task, your compute budget, your language coverage, and your sequence length constraints. Subword BPE at V=32K is the most common default because it balances all factors reasonably. But if you need multilingual coverage, you push V higher. If you need vision, you use patches. If you need extreme efficiency, you consider character-level with an efficient attention mechanism. Understanding the tradeoffs — not memorizing a recipe — is what matters.

Chapter 9: Cheat Sheet

Everything from this lesson, compressed into a reference you can come back to.

Decision Flowchart

What is your input?
Text → go to token embeddings. Images → go to patch embeddings.
Text: How many languages?
One language → V=32K is fine. Many languages → V=100K+ for script coverage.
Text: Model size?
Small (<1B params) → tie weights, it's free regularization. Large (>7B) → untying often wins.
Text: Position encoding?
Sinusoidal (fixed) → scale embeddings by √d. Learned or RoPE → no scaling needed.
Images: Resolution needed?
Standard → patch 16 (196 tokens). Speed-critical → patch 32 (49 tokens). Fine detail → patch 8 (784 tokens).

Component Catalog

ComponentWhat It IsShapeKey Fact
Token EmbeddingLookup table: integer → vector(V, d)No multiplication — just indexing
Position EmbeddingLookup table: position → vector(max_len, d)Added to token embedding (not concatenated)
Segment EmbeddingLookup table: segment ID → vector(2, d)BERT-specific; GPT and Llama don't use it
Patch EmbeddingConv2d or Linear projection(patch_dim, d)Equivalent to Conv2d(3, d, kernel=p, stride=p)
Weight TyingShare E between input and outputSaves V×d paramslm_head.weight = embed.weight (same tensor)
Embedding ScalingMultiply token emb by √dScalar multiplyOnly needed with fixed sinusoidal positions
[CLS] TokenLearnable classification token(1, d)Prepended to patch sequence in ViT

PyTorch Quick Reference

python
import torch
import torch.nn as nn
import math

# 1. Basic token embedding
emb = nn.Embedding(num_embeddings=32000, embedding_dim=768)
vectors = emb(torch.tensor([42, 1337]))  # shape: (2, 768)

# 2. Weight tying
lm_head = nn.Linear(768, 32000, bias=False)
lm_head.weight = emb.weight  # same tensor in memory

# 3. Embedding scaling (original Transformer)
scaled = emb(ids) * math.sqrt(768)  # match sinusoidal magnitude

# 4. Patch embedding (ViT)
patch_emb = nn.Conv2d(3, 768, kernel_size=16, stride=16)
patches = patch_emb(img).flatten(2).transpose(1, 2)  # (B, 196, 768)

# 5. BERT-style combined embedding
x = emb(token_ids) + pos_emb(positions) + seg_emb(segment_ids)

# 6. Subword tokenization (production)
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4")
tokens = enc.encode("transformerization")  # ['transform', 'er', 'ization']

Key Numbers to Remember

ModelVdEmb ParamsTied?Position
GPT-2 Small50,25776838.6MYesLearned
BERT-Base30,52276823.4MYesLearned
Llama 2 7B32,0004,096131MNoRoPE
Llama 3 8B128,2564,096525MNoRoPE
ViT-B/16N/A7680.6MN/ALearned

Connections

Embedding layers are the first thing that happens in every neural network that processes discrete inputs. Understanding them unlocks everything downstream:

"What I cannot create, I do not understand." — Richard Feynman. You now know how to build an embedding layer from scratch: the lookup table, the addition of position and segment signals, the scaling trick, the weight tying optimization, the Conv2d patch projection, and the subword tokenization that makes it all work for real language. Every transformer starts here.
A model uses V=100,000, d=2,048, tied weights, and learned position embeddings (max_len=4,096). How many total embedding parameters does it have?