Introduction
Language models operate on numbers. Not words, not characters, not meaning — just sequences of integers fed through matrix multiplications. The translation layer between human text and this numerical reality is what we call tokenization and embedding, and it is simultaneously the simplest and most consequential part of the entire LLM stack.
Simple, because conceptually it's just a lookup table: each token maps to a position in a vocabulary, and each vocabulary position maps to a learned vector. Consequential, because every quirk of this translation — every artificial boundary, every merged subword, every dimension of every vector — echoes through every downstream capability and failure mode.
Why does GPT-4 sometimes struggle with simple arithmetic on numbers above 1,000? Tokenization. Why does BERT treat "tokenization" differently from "token" + "##ization"? Tokenization. Why do LLMs sometimes fail to count letters in words? Tokenization. Why does Claude handle code and multilingual text differently than earlier models? Tokenization.
The Tokenization Problem
A transformer model takes a fixed-length sequence of vectors as input. Before any learning can happen, we need to answer a deceptively hard question: what is the basic unit of text?
This question has no perfect answer. Language doesn't come pre-segmented. The gaps between words are a convention of written English, absent in many languages (Thai, Japanese, Chinese, and ancient texts all lack word boundaries). Even in English, is "don't" one unit or two? Is "New York" one or two? Is "C++" a single token or three?
Three broad approaches have emerged, each with distinct tradeoffs:
Character-level tokenization
The most granular approach: each character becomes a token. "Hello" becomes
['H','e','l','l','o']. The vocabulary is tiny — roughly 256 tokens covers all printable
ASCII; a few thousand covers Unicode's practical subset. This simplicity is appealing.
The problem: sequences become very long. A 1,000-word article might span 5,000–6,000 characters. Transformer attention complexity scales quadratically with sequence length, so character-level models are prohibitively expensive to train at scale. The model must also learn to assemble characters into meaningful units entirely from scratch — "machine" requires the model to figure out that m-a-c-h-i-n-e is a concept, without any prior signal.
Character-level models exist (CharRNN, some multilingual models), but no major production LLM uses pure character-level tokenization.
Word-level tokenization
Split on whitespace and punctuation. "The cat sat." becomes ['The', 'cat', 'sat', '.'].
This maps naturally to how humans think about language.
But it fails in several ways. English has over 170,000 distinct words in active use, and technical
writing, code, and proper nouns expand this dramatically. A vocabulary of 100,000 tokens requires a
100,000-dimensional embedding matrix — already enormous. Worse: any word not in the vocabulary
must be represented as a special <UNK> token, losing all meaning. "Tokenization"
and "tokenize" become two completely independent, unrelated tokens despite sharing obvious meaning.
Morphological richness kills word-level approaches. In Finnish or Turkish, the same root word can produce hundreds of grammatically distinct surface forms. A word-level tokenizer would need a vocabulary of millions just for one morphologically rich language — impossible for multilingual models.
Subword tokenization — the sweet spot
Subword tokenization decomposes words into frequently-occurring fragments. Common words like "the", "is", "a" get their own tokens. Rare or novel words split into subword pieces.
"Tokenization" might become ['token', 'ization']. "Unbelievable" might become
['un', 'believ', 'able']. A word never seen during training — say, "grokking" —
might become ['gro', 'kk', 'ing']. As a last resort, individual bytes handle any
possible input without ever needing an <UNK> token.
This gives us vocabularies of 32,000–100,000 tokens that can represent any text, with sequences roughly 3–5× shorter than character-level, and with meaningful subword units that help the model generalize across morphological variants.
Type text below to see how different tokenization strategies split it.
Byte Pair Encoding
In 1994, Philip Gage published a simple data compression trick: scan through a byte sequence, find the most frequently occurring pair of adjacent bytes, replace every occurrence with a new unused byte, and record the substitution. Repeat until you hit a size target. The file shrinks because common patterns get encoded with single symbols.
Two decades later, Rico Sennrich and colleagues adapted this algorithm for neural machine translation (Sennrich et al., 2016). The context changed from byte compression to vocabulary construction, but the insight was identical: common sequences deserve dedicated representations.
The Algorithm, Step by Step
BPE builds a vocabulary by iteratively merging the most frequent adjacent token pair in a training corpus. Starting from individual characters (or bytes), it grows the vocabulary by repeatedly combining common neighbors.
Here's the procedure formally:
- Initialize the vocabulary with all unique characters in the corpus (plus a special end-of-word marker, commonly
▁or</w>). - Represent each word in the corpus as a sequence of characters from this vocabulary.
- Count all adjacent pair frequencies across the entire corpus (weighted by word frequency).
- Merge the most frequent pair into a single new token, adding it to the vocabulary.
- Update all corpus representations to use the new merged token.
- Repeat steps 3–5 for N merge operations, where N is your target vocabulary size minus the initial character set.
The result is a vocabulary where frequent words appear as single tokens, common morphemes and
subwords appear as merged tokens, and rare words decompose into smaller pieces. GPT-2 uses 50,000
merges; GPT-4's cl100k_base uses approximately 100,000.
Watch BPE build a vocabulary from a small corpus. The purple tokens are the pair about to be merged; green shows merged tokens.
Byte-level BPE
GPT-2 introduced an important refinement: instead of working with Unicode characters, it works with
raw UTF-8 bytes. Every possible byte value (0–255) is in the initial vocabulary. This eliminates the
need for an <UNK> token entirely — any possible text can be represented as a
sequence of bytes, which can then be merged upward through BPE.
Byte-level BPE also handles multilingual text gracefully. A Chinese character that's rare in training data doesn't get a dedicated token; it decomposes into 3 UTF-8 bytes which can be learned and merged independently.
GPT models often struggle with arithmetic on large numbers because of how BPE tokenizes them.
"12,345,678" might tokenize as ['12', ',', '345', ',', '678'] — splitting at arbitrary
digit boundaries. The model has no guarantee that adjacent digit-tokens represent adjacent powers of
ten. More reliable arithmetic requires either fine-tuning on arithmetic data or tool use to
offload calculations.
WordPiece & SentencePiece
BPE isn't the only subword algorithm. Two others deserve mention because they're used by major models you'll encounter in practice.
WordPiece (BERT, DistilBERT, ELECTRA)
WordPiece, developed at Google (Schuster & Nakamura, 2012; Devlin et al., 2018), is structurally
similar to BPE but with a different merge criterion. Instead of merging the most frequent
pair, it merges the pair that maximizes the language model likelihood of the training corpus.
Formally, it prefers merging A and B when the score
freq(AB) / (freq(A) × freq(B)) is highest — choosing pairs that are common together
but individually less common (i.e., collocations that deserve a merged representation).
WordPiece uses a distinctive notation: continuation subwords are prefixed with ##.
So "tokenization" tokenizes as ['token', '##ization'] in BERT. The ##
signals that this piece attaches to a preceding token, not a word boundary.
SentencePiece (T5, LLaMA, Mistral, Gemma)
SentencePiece (Kudo & Richardson, 2018) takes a different architectural approach: it treats the
input as a raw stream of Unicode characters with no pre-tokenization step. Most tokenizers
first split on whitespace/punctuation, then apply BPE within words. SentencePiece operates on the
raw text, treating spaces as regular characters (representing them as ▁).
This makes SentencePiece truly language-agnostic — it works on Japanese, Thai, Arabic, and mixed
code/text without needing language-specific pre-processing rules. The ▁ prefix on
tokens indicates a word-initial position (e.g., ▁token, ization).
SentencePiece supports both BPE and a different algorithm called Unigram Language Model, which starts with a large vocabulary and prunes it down by removing tokens that decrease corpus likelihood least. LLaMA 2 uses SentencePiece BPE with 32,000 tokens; LLaMA 3 switched to tiktoken's BPE with 128,256 tokens to improve multilingual and code coverage.
| Tokenizer | Algorithm | Used by | Continuation marker | Pre-tokenization |
|---|---|---|---|---|
| tiktoken BPE | Byte-level BPE | GPT-2, GPT-3, GPT-4, LLaMA 3 | Ġ (space prefix) |
Regex split |
| WordPiece | Likelihood-based BPE | BERT, DistilBERT, ELECTRA | ## (continuation) |
Whitespace + punct |
| SentencePiece BPE | BPE on raw text | LLaMA 1/2, Mistral, T5 | ▁ (word start) |
None (raw text) |
| SentencePiece Unigram | Unigram LM pruning | T5, mT5, XLNet | ▁ (word start) |
None (raw text) |
Special Tokens
Beyond regular vocabulary, every tokenizer includes special tokens — reserved symbols that serve structural roles during training and inference. These are never produced by the BPE merge process; they're explicitly added to the vocabulary.
Common special tokens and their roles:
-
<|endoftext|>— GPT's document boundary marker. Training data concatenates documents separated by this token so the model learns to predict across natural end-of-document positions. -
[CLS],[SEP]— BERT's classification and separator tokens.[CLS]prepended to every input; its final hidden state is used as a sequence-level representation.[SEP]separates two input sequences in tasks like sentence pair classification. -
[PAD]— Padding token used to make all sequences in a batch the same length, required for efficient batch processing. -
[MASK]— BERT's masked language model token. During training, 15% of tokens are replaced with[MASK]and the model learns to predict the original. -
<|im_start|>,<|im_end|>— ChatML-format role markers used by OpenAI and others to structure conversations in instruction-tuned models. -
<s>,</s>— LLaMA/SentencePiece's beginning and end of sequence tokens.
Chat models' special tokens are particularly important — they define the template that structures
system prompts, user messages, and assistant responses. This is why you can't just feed raw text
to an instruction-tuned model; the chat template (applied by apply_chat_template()
in HuggingFace) wraps your content in the right structural tokens before tokenization.
From Tokens to Vectors
Tokenization gives us a sequence of integer IDs — say, [9906, 11, 1917, 0] for
"Hello, world!". But matrix multiplications don't operate on integers. The model needs
continuous, dense vectors. The embedding layer is the bridge.
One-hot encoding: the naive baseline
The simplest representation: a vector with one dimension per vocabulary token, all zeros except a single 1 at the index of this token. For a vocabulary of 50,257 tokens, every token becomes a 50,257-dimensional sparse vector.
This is computationally wasteful (50,000+ dimensions almost entirely unused), but more fundamentally it encodes no relationships. The cosine similarity between any two distinct one-hot vectors is exactly zero. "cat" and "dog" are as orthogonal as "cat" and "photosynthesis". All semantic structure must be discovered by layers stacked on top, with no useful prior in the representation itself.
Dense embeddings: learned geometry
The solution is to learn a dense representation. Instead of mapping each token to a sparse identity vector, we map it to a dense vector in a much smaller space — typically 256 to 8,192 dimensions depending on model scale.
This mapping is stored in an embedding matrix E of shape
(vocab_size, d_model). For GPT-2: (50,257, 768), containing roughly 38
million learnable parameters. Looking up a token's embedding is simply indexing into this matrix:
e = E[token_id].
These embeddings aren't initialized with any semantic knowledge — they start as random noise. During training, gradient descent adjusts every embedding such that the model minimizes prediction loss. Tokens that appear in similar contexts end up with similar vectors, because the model learns to process them similarly. This is the distributional hypothesis in action: "You shall know a word by the company it keeps." (Firth, 1957)
Word2Vec: Where Embedding Geometry Began
Word2Vec (Mikolov et al., 2013) isn't used in modern transformer LLMs, but it remains the conceptual foundation for understanding why learned embeddings work. It demonstrated that a neural network trained on a simple prediction task would, as a side effect, learn vectors with remarkable geometric structure.
Word2Vec comes in two flavors with a key architectural difference:
CBOW — Continuous Bag of Words
Given the surrounding context words, predict the target word. For the sentence "The quick brown [fox] jumps over the lazy dog":
Input: ["The", "quick", "brown", "jumps", "over"] → Predict: "fox"
CBOW is faster to train and works better for frequent words. The "bag of words" part means context word order is ignored — only their averaged embeddings matter.
Skip-gram
The inverse: given a target word, predict each of the surrounding context words.
Input: "fox" → Predict: ["The", "quick", "brown", "jumps", "over"]
Skip-gram trains slower but produces better embeddings for rare words. By training to predict multiple contexts from a single center word, the model is forced to encode rich co-occurrence information into each vector.
Why the task creates meaningful geometry
Here's the insight: the model has no capacity to memorize each word's individual contexts — it must generalize. Two words that appear in nearly identical contexts ("cat" and "dog" both appear near "pet", "fur", "cute", "feeding", etc.) must end up with nearly identical embeddings, because they need to produce similar context predictions. The geometry of the embedding space directly reflects the distributional statistics of the corpus.
This produces Word2Vec's famous analogy property:
Paris − France + Germany ≈ Berlin
walking − walk + swim ≈ swimming
The "king minus man" operation subtracts the "male royalty" direction and adds the "female" direction, landing near "queen". This isn't explicitly programmed — it emerges from the distributional statistics of English text.
Simplified 2D projection of word embeddings. Words cluster by semantic category. Hover over points to see words. The arrow shows the king − man + woman = queen analogy.
Transformer Token Embeddings
In a transformer, the embedding layer is a straightforward lookup table: an nn.Embedding
module in PyTorch that wraps a learnable matrix. The forward pass is just a gather operation —
index into the matrix by token ID to retrieve the corresponding row.
import torch
import torch.nn as nn
vocab_size = 50_257 # GPT-2 vocabulary size
d_model = 768 # GPT-2 base hidden dimension
# The embedding matrix: shape (50257, 768)
# ~38.6M learnable parameters
token_embedding = nn.Embedding(vocab_size, d_model)
# Lookup: shape (batch, seq_len) → (batch, seq_len, d_model)
token_ids = torch.tensor([[9906, 11, 1917, 0]]) # "Hello, world!"
x = token_embedding(token_ids) # [1, 4, 768]
# Equivalent to direct matrix indexing:
assert torch.allclose(x, token_embedding.weight[token_ids])
A critical implementation detail: most large language models use weight tying. The embedding matrix E (input) and the unembedding matrix U (the final linear layer that projects back to vocabulary logits) share the same weights.
This has two motivations. First, it halves the parameter count at the vocabulary interface — for GPT-3, this saves ~617M parameters. Second, there's a conceptual justification: if token A should predict token B, then A and B should be similar in embedding space. Using the same matrix for both encoding and decoding enforces this consistency.
where E is the same matrix used for token lookup.
The predicted probability of next token j is: softmax(logits)[j]
The Position Problem
Token embeddings encode what each token is, but not where it appears in the sequence. This is a fundamental problem for transformers.
Self-attention is permutation-equivariant: if you shuffle the input tokens, the attention outputs shuffle in exactly the same way. The model has no intrinsic sense of order. "The cat sat on the mat" and "mat the on sat cat The" would, without positional information, produce identical attention patterns and identical intermediate representations for each token.
The original transformer paper (Vaswani et al., 2017) solved this by adding a positional encoding vector to each token embedding before feeding into the first attention layer. Different positions get different positional vectors, and the model learns to use this positional information to understand sequence order.
Sinusoidal Positional Encoding
Vaswani et al.'s original solution uses a fixed (non-learned) sinusoidal formula:
PE(pos, 2i+1) = cos(pos / 100002i/dmodel)
Where pos is the token's position in the sequence (0, 1, 2, …) and i
is the dimension index. Even dimensions use sine, odd dimensions use cosine.
The intuition: think of position as a number written in a mixed-frequency numeral system. Each
dimension pair (2i, 2i+1) operates at a different frequency — low dimensions change
slowly (long wavelength, capturing coarse position), high dimensions change quickly (short
wavelength, capturing fine position). Together, they form a unique "fingerprint" for every
position up to sequence lengths far beyond training data.
Key properties of sinusoidal PE:
- No learned parameters — entirely determined by the formula.
- Bounded — all values in [-1, 1], same scale as embeddings.
- Relative positions have a consistent linear relationship — there exists a linear transformation T such that PE(pos+k) = T · PE(pos) for any offset k.
- Extrapolates beyond training length — any position can be encoded, even positions never seen during training.
Each column is a position (0–127), each row is a dimension. Notice how low dimensions oscillate slowly (capturing coarse position) while high dimensions oscillate quickly (capturing fine position).
Rotary Positional Embeddings (RoPE)
The problem with additive positional encodings (both sinusoidal and learned) is subtle but consequential: after adding position to the token embedding, the two signals are entangled. The attention mechanism computes dot products between query and key vectors:
When position is added to the embedding, the query-key dot product at positions m and n mixes
absolute position signals in a way that makes relative position reasoning non-trivial. Ideally,
what we want is for the dot product qm · kn to depend only
on the relative position m - n, not on the absolute values of m and n.
RoPE (Su et al., 2021, "RoFormer") achieves exactly this by encoding position as a rotation applied to query and key vectors, rather than an addition to the embedding.
The key insight: if you rotate a query vector by angle θm and a key vector by angle θn, their dot product depends only on the difference θm − θn (relative position), because:
The rotation is applied in pairs of dimensions. For each pair of dimensions (2i, 2i+1):
where θi = 10000−2i/d
The rotation frequency θi follows the same 10000^{-2i/d}
schedule as the original sinusoidal PE — low dimensions rotate slowly (large wavelength, captures
coarse relative position), high dimensions rotate quickly.
RoPE's advantages:
- Relative position is baked in — attention scores naturally depend on m−n, not m and n individually.
- Better length generalization — models trained on shorter sequences can attend more gracefully to longer ones (with techniques like RoPE scaling / YaRN).
- No extra parameters — like sinusoidal PE, RoPE is determined by the formula.
- Widely adopted — LLaMA (all versions), Mistral, Gemma, Falcon, GPT-NeoX, and most recent models use RoPE.
Each token's query/key vector is rotated by an angle proportional to its position. Drag the slider to see how different positions rotate the same vector, and observe that the relative angle between two positions stays constant.
Embedding Dimensions & Model Scale
The embedding dimension d_model is one of the primary architectural hyperparameters
of a transformer, and it determines the width of the network's information highway. Every intermediate
representation — queries, keys, values, feed-forward activations — is expressed in this space.
Conventional wisdom holds that d_model should be a multiple of 64 (for efficient
memory alignment on GPUs) and that d_model = n_heads × 64 is a common choice,
giving each attention head 64 dimensions for its query/key/value projections.
| Model | d_model | n_heads | n_layers | Vocab size | Embedding params |
|---|---|---|---|---|---|
| BERT-base | 768 | 12 | 12 | 30,522 | ~23M |
| GPT-2 base | 768 | 12 | 12 | 50,257 | ~39M |
| GPT-2 XL | 1,600 | 25 | 48 | 50,257 | ~80M |
| GPT-3 175B | 12,288 | 96 | 96 | 50,257 | ~617M |
| LLaMA 2 7B | 4,096 | 32 | 32 | 32,000 | ~131M |
| LLaMA 3 8B | 4,096 | 32 | 32 | 128,256 | ~525M |
| LLaMA 2 70B | 8,192 | 64 | 80 | 32,000 | ~262M |
A few observations from this table:
- LLaMA 3's jump from 32K to 128K vocabulary quadrupled the embedding table to ~525M params — a significant fraction of the model's total. Larger vocabularies improve tokenization efficiency (fewer tokens per document = lower inference cost) but increase memory at the vocabulary interface.
- GPT-3's embedding table alone (~617M params) is larger than many smaller models. With weight tying, the unembedding layer is free, but the vocabulary embedding is a real cost.
-
Larger
d_modelgives each attention head more expressive capacity but increases compute quadratically relative to batch size and sequence length in the feedforward layers.
The embedding dimension determines how much information can flow between layers. If d_model
is too small, the representations become an information bottleneck — the model can't retain
all the contextual nuance needed for complex reasoning. Scaling laws research (Kaplan et al., 2020;
Hoffmann et al., 2022) shows that increasing model width (d_model, n_heads) and depth (n_layers)
improves performance predictably — but optimal compute allocation matters. Chinchilla showed that
many earlier models were undertrained for their size.
Code Examples
Theory is best anchored by working code. Here are practical examples covering the full tokenization pipeline using tiktoken (OpenAI's tokenizer) and HuggingFace transformers.
tiktoken — GPT-4 tokenization
import tiktoken
# cl100k_base is used by GPT-4, GPT-3.5-turbo, and text-embedding-ada-002
enc = tiktoken.get_encoding("cl100k_base")
text = "Tokenization is surprisingly fascinating. Let's explore."
tokens = enc.encode(text)
print(f"Token IDs: {tokens}")
print(f"Token count: {len(tokens)}")
# Inspect each token
for tid in tokens:
token_bytes = enc.decode_single_token_bytes(tid)
print(f" {tid:6d} {token_bytes!r}")
# Decode back to string
assert enc.decode(tokens) == text
# ── Comparing tokenization across models ──────────────────
gpt2_enc = tiktoken.get_encoding("gpt2") # 50,257 tokens
gpt4_enc = tiktoken.get_encoding("cl100k_base") # 100,277 tokens
gpt4o_enc = tiktoken.get_encoding("o200k_base") # 200,019 tokens
code_sample = "def transformer_block(x, attn, ffn):\n return ffn(x + attn(x))"
for name, enc in [("gpt2", gpt2_enc), ("gpt4", gpt4_enc), ("gpt4o", gpt4o_enc)]:
n = len(enc.encode(code_sample))
print(f"{name:8s}: {n} tokens")
# gpt2 : 22 tokens
# gpt4 : 19 tokens (better code tokenization in cl100k)
# gpt4o : 16 tokens (even better in o200k)
HuggingFace tokenizers — comparing BERT, GPT-2, LLaMA
from transformers import AutoTokenizer
# GPT-2 (byte-level BPE, Ġ = space prefix)
gpt2 = AutoTokenizer.from_pretrained("gpt2")
print("GPT-2:", gpt2.tokenize("tokenization is fascinating"))
# ['token', 'ization', 'Ġis', 'Ġfascinating']
# BERT (WordPiece, ## = continuation)
bert = AutoTokenizer.from_pretrained("bert-base-uncased")
print("BERT:", bert.tokenize("tokenization is fascinating"))
# ['token', '##ization', 'is', 'fascinating']
# LLaMA 2 (SentencePiece BPE, ▁ = word start)
llama = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
print("LLaMA:", llama.tokenize("tokenization is fascinating"))
# ['▁token', 'ization', '▁is', '▁fas', 'cin', 'ating']
# ── Chat templates ─────────────────────────────────────────
# Modern instruction-tuned models wrap messages in special tokens.
# NEVER skip apply_chat_template for chat models.
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is a transformer?"},
]
# Using a model that supports chat templates (e.g. Llama-3-8B-Instruct)
# chat_enc = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
# formatted = chat_enc.apply_chat_template(messages, tokenize=False)
# print(formatted)
# <|begin_of_text|><|start_header_id|>system<|end_header_id|>
# You are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>
# What is a transformer?<|eot_id|><|start_header_id|>assistant<|end_header_id|>
Sinusoidal positional encoding in PyTorch
import torch
import math
def sinusoidal_positional_encoding(max_seq_len: int, d_model: int) -> torch.Tensor:
"""
Returns PE matrix of shape (max_seq_len, d_model).
PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
"""
PE = torch.zeros(max_seq_len, d_model)
pos = torch.arange(max_seq_len, dtype=torch.float).unsqueeze(1) # (L, 1)
div = torch.exp(
torch.arange(0, d_model, 2, dtype=torch.float) *
(-math.log(10000.0) / d_model)
) # (d/2,)
PE[:, 0::2] = torch.sin(pos * div) # even dims
PE[:, 1::2] = torch.cos(pos * div) # odd dims
return PE # (max_seq_len, d_model)
# ── RoPE implementation ────────────────────────────────────
def apply_rope(x: torch.Tensor, positions: torch.Tensor) -> torch.Tensor:
"""
Apply Rotary Positional Embedding to queries or keys.
Args:
x: (batch, heads, seq_len, head_dim)
positions: (batch, seq_len) — position indices
Returns:
x with RoPE applied, same shape as input.
"""
batch, heads, seq_len, head_dim = x.shape
assert head_dim % 2 == 0
# Compute rotation frequencies (same as sinusoidal)
i = torch.arange(0, head_dim, 2, dtype=torch.float, device=x.device)
theta = 1.0 / (10000.0 ** (i / head_dim)) # (head_dim/2,)
# angles[b, s, i] = positions[b, s] * theta[i]
angles = positions.float().unsqueeze(-1) * theta # (batch, seq_len, head_dim/2)
sin = angles.sin().unsqueeze(1) # (batch, 1, seq_len, head_dim/2)
cos = angles.cos().unsqueeze(1) # (batch, 1, seq_len, head_dim/2)
# Split into even and odd dimensions
x_even = x[..., 0::2] # (batch, heads, seq_len, head_dim/2)
x_odd = x[..., 1::2]
# Apply rotation: [x_even, x_odd] → [x_even*cos - x_odd*sin, x_even*sin + x_odd*cos]
x_rot_even = x_even * cos - x_odd * sin
x_rot_odd = x_even * sin + x_odd * cos
# Interleave back
x_out = torch.stack([x_rot_even, x_rot_odd], dim=-1)
return x_out.flatten(-2) # (batch, heads, seq_len, head_dim)
# ── Full embedding layer (token + position) ────────────────
class TransformerEmbedding(nn.Module):
def __init__(self, vocab_size: int, d_model: int, max_seq_len: int = 2048,
dropout: float = 0.1, use_learned_pos: bool = True):
super().__init__()
self.token_emb = nn.Embedding(vocab_size, d_model)
self.use_learned_pos = use_learned_pos
if use_learned_pos:
# GPT-2 style: learned positional embedding
self.pos_emb = nn.Embedding(max_seq_len, d_model)
else:
# Original transformer: fixed sinusoidal
pe = sinusoidal_positional_encoding(max_seq_len, d_model)
self.register_buffer('pos_enc', pe) # not a parameter
self.dropout = nn.Dropout(dropout)
self.d_model = d_model
def forward(self, token_ids: torch.Tensor) -> torch.Tensor:
# token_ids: (batch, seq_len)
seq_len = token_ids.shape[1]
tok = self.token_emb(token_ids) # (B, L, d_model)
if self.use_learned_pos:
positions = torch.arange(seq_len, device=token_ids.device)
pos = self.pos_emb(positions) # (L, d_model)
else:
pos = self.pos_enc[:seq_len] # (L, d_model)
# Scale token embeddings by sqrt(d_model) — from original paper
return self.dropout(tok * math.sqrt(self.d_model) + pos)
Exploring vocabulary structure
import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
print(f"Vocabulary size: {enc.n_vocab:,}") # 100,277
# ── How are numbers tokenized? ─────────────────────────────
# Single-digit numbers are always 1 token
# Multi-digit numbers may split at varying boundaries
for n in [1, 12, 123, 1234, 12345, 123456, 1234567]:
toks = enc.encode(str(n))
pieces = [enc.decode([t]) for t in toks]
print(f"{n:>8,} → {pieces} ({len(toks)} token{'s' if len(toks)>1 else ''})")
# 1 → ['1'] (1 token)
# 1,234 → ['1234'] (1 token)
# 12,345 → ['123', '45'] (2 tokens) ← arbitrary split!
# 1,234,567→ ['123', '456', '7'] (3 tokens)
# ── Tokenization pitfalls ──────────────────────────────────
# Counting characters in tokens is non-trivial for the model
# because each token is a different length
word = "strawberry"
toks = enc.encode(word)
for t in toks:
print(repr(enc.decode([t]))) # 'str', 'aw', 'berry'
# 3 tokens — the model can't easily "see" that there are 3 r's!
# ── Multilingual tokenization efficiency ──────────────────
texts = {
"English": "The quick brown fox",
"Spanish": "El rápido zorro marrón",
"Chinese": "快速的棕色狐狸",
"Arabic": "الثعلب البني السريع",
"Code": "def quick_sort(arr): return sorted(arr)",
}
for lang, text in texts.items():
n = len(enc.encode(text))
print(f"{lang:12s}: {n:3d} tokens (for: {text!r})")
# English : 5 tokens — 1 token/word (efficient)
# Spanish : 8 tokens — more tokens for diacritics
# Chinese : 10 tokens — ~3 bytes/char → more tokens
# Arabic : 11 tokens — similar pattern
# Code : 9 tokens — special tokens for syntax
The multilingual efficiency gap is one reason newer models like LLaMA 3 expanded their vocabulary from 32K to 128K tokens — common Chinese, Japanese, Korean, and Arabic sequences can now get dedicated tokens rather than fragmenting into many byte-level pieces.
Now that we understand how text becomes vectors, the next question is: how do those vectors interact? In Article 02: Attention & Transformer Blocks, we'll dissect the self-attention mechanism — why it works, how queries/keys/values actually compute attention weights, and how the full transformer block assembles these primitives into a system capable of learning language.
References
Seminal papers and key works referenced in this article.
- Sennrich et al. "Neural Machine Translation of Rare Words with Subword Units." ACL, 2016. arXiv
- Kudo & Richardson. "SentencePiece: A simple and language independent subword tokenizer." EMNLP, 2018. arXiv
- Mikolov et al. "Efficient Estimation of Word Representations in Vector Space." ICLR Workshop, 2013. arXiv
- Vaswani et al. "Attention Is All You Need." NeurIPS, 2017. arXiv
- Su et al. "RoFormer: Enhanced Transformer with Rotary Position Embedding." 2021. arXiv
- Radford et al. "Language Models are Unsupervised Multitask Learners." OpenAI, 2019.