Tokenization & Embeddings — Engineermaxxing

Introduction

Language models operate on numbers. Not words, not characters, not meaning — just sequences of integers fed through matrix multiplications. The translation layer between human text and this numerical reality is what we call tokenization and embedding, and it is simultaneously the simplest and most consequential part of the entire LLM stack.

Simple, because conceptually it's just a lookup table: each token maps to a position in a vocabulary, and each vocabulary position maps to a learned vector. Consequential, because every quirk of this translation — every artificial boundary, every merged subword, every dimension of every vector — echoes through every downstream capability and failure mode.

Why does GPT-4 sometimes struggle with simple arithmetic on numbers above 1,000? Tokenization. Why does BERT treat "tokenization" differently from "token" + "##ization"? Tokenization. Why do LLMs sometimes fail to count letters in words? Tokenization. Why does Claude handle code and multilingual text differently than earlier models? Tokenization.

ℹ What this article covers

We'll start with the why and what of tokenization, work through the BPE algorithm in full detail with an interactive step-through visualizer, then move to the embedding space — from one-hot vectors to learned dense embeddings, Word2Vec's conceptual insight, transformer-specific embeddings, and finally the subtle but important problem of positional encoding including RoPE.

The Tokenization Problem

A transformer model takes a fixed-length sequence of vectors as input. Before any learning can happen, we need to answer a deceptively hard question: what is the basic unit of text?

This question has no perfect answer. Language doesn't come pre-segmented. The gaps between words are a convention of written English, absent in many languages (Thai, Japanese, Chinese, and ancient texts all lack word boundaries). Even in English, is "don't" one unit or two? Is "New York" one or two? Is "C++" a single token or three?

Three broad approaches have emerged, each with distinct tradeoffs:

Character-level tokenization

The most granular approach: each character becomes a token. "Hello" becomes ['H','e','l','l','o']. The vocabulary is tiny — roughly 256 tokens covers all printable ASCII; a few thousand covers Unicode's practical subset. This simplicity is appealing.

The problem: sequences become very long. A 1,000-word article might span 5,000–6,000 characters. Transformer attention complexity scales quadratically with sequence length, so character-level models are prohibitively expensive to train at scale. The model must also learn to assemble characters into meaningful units entirely from scratch — "machine" requires the model to figure out that m-a-c-h-i-n-e is a concept, without any prior signal.

Character-level models exist (CharRNN, some multilingual models), but no major production LLM uses pure character-level tokenization.

Word-level tokenization

Split on whitespace and punctuation. "The cat sat." becomes ['The', 'cat', 'sat', '.']. This maps naturally to how humans think about language.

But it fails in several ways. English has over 170,000 distinct words in active use, and technical writing, code, and proper nouns expand this dramatically. A vocabulary of 100,000 tokens requires a 100,000-dimensional embedding matrix — already enormous. Worse: any word not in the vocabulary must be represented as a special <UNK> token, losing all meaning. "Tokenization" and "tokenize" become two completely independent, unrelated tokens despite sharing obvious meaning.

Morphological richness kills word-level approaches. In Finnish or Turkish, the same root word can produce hundreds of grammatically distinct surface forms. A word-level tokenizer would need a vocabulary of millions just for one morphologically rich language — impossible for multilingual models.

Subword tokenization — the sweet spot

Subword tokenization decomposes words into frequently-occurring fragments. Common words like "the", "is", "a" get their own tokens. Rare or novel words split into subword pieces.

"Tokenization" might become ['token', 'ization']. "Unbelievable" might become ['un', 'believ', 'able']. A word never seen during training — say, "grokking" — might become ['gro', 'kk', 'ing']. As a last resort, individual bytes handle any possible input without ever needing an <UNK> token.

This gives us vocabularies of 32,000–100,000 tokens that can represent any text, with sequences roughly 3–5× shorter than character-level, and with meaningful subword units that help the model generalize across morphological variants.

Tokenization Comparison Interactive

Type text below to see how different tokenization strategies split it.

Character-level

Word-level

Subword (simulated BPE)

Byte Pair Encoding

In 1994, Philip Gage published a simple data compression trick: scan through a byte sequence, find the most frequently occurring pair of adjacent bytes, replace every occurrence with a new unused byte, and record the substitution. Repeat until you hit a size target. The file shrinks because common patterns get encoded with single symbols.

Two decades later, Rico Sennrich and colleagues adapted this algorithm for neural machine translation (Sennrich et al., 2016). The context changed from byte compression to vocabulary construction, but the insight was identical: common sequences deserve dedicated representations.

The Algorithm, Step by Step

BPE builds a vocabulary by iteratively merging the most frequent adjacent token pair in a training corpus. Starting from individual characters (or bytes), it grows the vocabulary by repeatedly combining common neighbors.

Here's the procedure formally:

Initialize the vocabulary with all unique characters in the corpus (plus a special end-of-word marker, commonly ▁ or </w>).
Represent each word in the corpus as a sequence of characters from this vocabulary.
Count all adjacent pair frequencies across the entire corpus (weighted by word frequency).
Merge the most frequent pair into a single new token, adding it to the vocabulary.
Update all corpus representations to use the new merged token.
Repeat steps 3–5 for N merge operations, where N is your target vocabulary size minus the initial character set.

The result is a vocabulary where frequent words appear as single tokens, common morphemes and subwords appear as merged tokens, and rare words decompose into smaller pieces. GPT-2 uses 50,000 merges; GPT-4's cl100k_base uses approximately 100,000.

BPE Algorithm — Step-by-Step Interactive

Watch BPE build a vocabulary from a small corpus. The purple tokens are the pair about to be merged; green shows merged tokens.

Step 0 — initial state

Press Next Step to begin. The most frequent token pair will be highlighted.

Corpus (word → tokens × freq)

Vocabulary (0 tokens)

Merge History

Byte-level BPE

GPT-2 introduced an important refinement: instead of working with Unicode characters, it works with raw UTF-8 bytes. Every possible byte value (0–255) is in the initial vocabulary. This eliminates the need for an <UNK> token entirely — any possible text can be represented as a sequence of bytes, which can then be merged upward through BPE.

Byte-level BPE also handles multilingual text gracefully. A Chinese character that's rare in training data doesn't get a dedicated token; it decomposes into 3 UTF-8 bytes which can be learned and merged independently.

💡 Why numbers tokenize badly

GPT models often struggle with arithmetic on large numbers because of how BPE tokenizes them. "12,345,678" might tokenize as ['12', ',', '345', ',', '678'] — splitting at arbitrary digit boundaries. The model has no guarantee that adjacent digit-tokens represent adjacent powers of ten. More reliable arithmetic requires either fine-tuning on arithmetic data or tool use to offload calculations.

WordPiece & SentencePiece

BPE isn't the only subword algorithm. Two others deserve mention because they're used by major models you'll encounter in practice.

WordPiece (BERT, DistilBERT, ELECTRA)

WordPiece, developed at Google (Schuster & Nakamura, 2012; Devlin et al., 2018), is structurally similar to BPE but with a different merge criterion. Instead of merging the most frequent pair, it merges the pair that maximizes the language model likelihood of the training corpus. Formally, it prefers merging A and B when the score freq(AB) / (freq(A) × freq(B)) is highest — choosing pairs that are common together but individually less common (i.e., collocations that deserve a merged representation).

WordPiece uses a distinctive notation: continuation subwords are prefixed with ##. So "tokenization" tokenizes as ['token', '##ization'] in BERT. The ## signals that this piece attaches to a preceding token, not a word boundary.

SentencePiece (T5, LLaMA, Mistral, Gemma)

SentencePiece (Kudo & Richardson, 2018) takes a different architectural approach: it treats the input as a raw stream of Unicode characters with no pre-tokenization step. Most tokenizers first split on whitespace/punctuation, then apply BPE within words. SentencePiece operates on the raw text, treating spaces as regular characters (representing them as ▁).

This makes SentencePiece truly language-agnostic — it works on Japanese, Thai, Arabic, and mixed code/text without needing language-specific pre-processing rules. The ▁ prefix on tokens indicates a word-initial position (e.g., ▁token, ization).

SentencePiece supports both BPE and a different algorithm called Unigram Language Model, which starts with a large vocabulary and prunes it down by removing tokens that decrease corpus likelihood least. LLaMA 2 uses SentencePiece BPE with 32,000 tokens; LLaMA 3 switched to tiktoken's BPE with 128,256 tokens to improve multilingual and code coverage.

Tokenizer	Algorithm	Used by	Continuation marker	Pre-tokenization
tiktoken BPE	Byte-level BPE	GPT-2, GPT-3, GPT-4, LLaMA 3	`Ġ` (space prefix)	Regex split
WordPiece	Likelihood-based BPE	BERT, DistilBERT, ELECTRA	`##` (continuation)	Whitespace + punct
SentencePiece BPE	BPE on raw text	LLaMA 1/2, Mistral, T5	`▁` (word start)	None (raw text)
SentencePiece Unigram	Unigram LM pruning	T5, mT5, XLNet	`▁` (word start)	None (raw text)

Special Tokens

Beyond regular vocabulary, every tokenizer includes special tokens — reserved symbols that serve structural roles during training and inference. These are never produced by the BPE merge process; they're explicitly added to the vocabulary.

Common special tokens and their roles:

<|endoftext|> — GPT's document boundary marker. Training data concatenates documents separated by this token so the model learns to predict across natural end-of-document positions.
[CLS], [SEP] — BERT's classification and separator tokens. [CLS] prepended to every input; its final hidden state is used as a sequence-level representation. [SEP] separates two input sequences in tasks like sentence pair classification.
[PAD] — Padding token used to make all sequences in a batch the same length, required for efficient batch processing.
[MASK] — BERT's masked language model token. During training, 15% of tokens are replaced with [MASK] and the model learns to predict the original.
<|im_start|>, <|im_end|> — ChatML-format role markers used by OpenAI and others to structure conversations in instruction-tuned models.
<s>, </s> — LLaMA/SentencePiece's beginning and end of sequence tokens.

Chat models' special tokens are particularly important — they define the template that structures system prompts, user messages, and assistant responses. This is why you can't just feed raw text to an instruction-tuned model; the chat template (applied by apply_chat_template() in HuggingFace) wraps your content in the right structural tokens before tokenization.

From Tokens to Vectors

Tokenization gives us a sequence of integer IDs — say, [9906, 11, 1917, 0] for "Hello, world!". But matrix multiplications don't operate on integers. The model needs continuous, dense vectors. The embedding layer is the bridge.

One-hot encoding: the naive baseline

The simplest representation: a vector with one dimension per vocabulary token, all zeros except a single 1 at the index of this token. For a vocabulary of 50,257 tokens, every token becomes a 50,257-dimensional sparse vector.

This is computationally wasteful (50,000+ dimensions almost entirely unused), but more fundamentally it encodes no relationships. The cosine similarity between any two distinct one-hot vectors is exactly zero. "cat" and "dog" are as orthogonal as "cat" and "photosynthesis". All semantic structure must be discovered by layers stacked on top, with no useful prior in the representation itself.

Dense embeddings: learned geometry

The solution is to learn a dense representation. Instead of mapping each token to a sparse identity vector, we map it to a dense vector in a much smaller space — typically 256 to 8,192 dimensions depending on model scale.

This mapping is stored in an embedding matrix E of shape (vocab_size, d_model). For GPT-2: (50,257, 768), containing roughly 38 million learnable parameters. Looking up a token's embedding is simply indexing into this matrix: e = E[token_id].

These embeddings aren't initialized with any semantic knowledge — they start as random noise. During training, gradient descent adjusts every embedding such that the model minimizes prediction loss. Tokens that appear in similar contexts end up with similar vectors, because the model learns to process them similarly. This is the distributional hypothesis in action: "You shall know a word by the company it keeps." (Firth, 1957)

Word2Vec: Where Embedding Geometry Began

Word2Vec (Mikolov et al., 2013) isn't used in modern transformer LLMs, but it remains the conceptual foundation for understanding why learned embeddings work. It demonstrated that a neural network trained on a simple prediction task would, as a side effect, learn vectors with remarkable geometric structure.

Word2Vec comes in two flavors with a key architectural difference:

CBOW — Continuous Bag of Words

Given the surrounding context words, predict the target word. For the sentence "The quick brown [fox] jumps over the lazy dog":

Input: ["The", "quick", "brown", "jumps", "over"] → Predict: "fox"

CBOW is faster to train and works better for frequent words. The "bag of words" part means context word order is ignored — only their averaged embeddings matter.

Skip-gram

The inverse: given a target word, predict each of the surrounding context words.

Input: "fox" → Predict: ["The", "quick", "brown", "jumps", "over"]

Skip-gram trains slower but produces better embeddings for rare words. By training to predict multiple contexts from a single center word, the model is forced to encode rich co-occurrence information into each vector.

Why the task creates meaningful geometry

Here's the insight: the model has no capacity to memorize each word's individual contexts — it must generalize. Two words that appear in nearly identical contexts ("cat" and "dog" both appear near "pet", "fur", "cute", "feeding", etc.) must end up with nearly identical embeddings, because they need to produce similar context predictions. The geometry of the embedding space directly reflects the distributional statistics of the corpus.

This produces Word2Vec's famous analogy property:

king - man + woman \approx queen Paris - France + Germany \approx Berlin walking - walk + swim \approx swimming

The "king minus man" operation subtracts the "male royalty" direction and adds the "female" direction, landing near "queen". This isn't explicitly programmed — it emerges from the distributional statistics of English text.

Embedding Space — 2D Projection Interactive

Simplified 2D projection of word embeddings. Words cluster by semantic category. Hover over points to see words. The arrow shows the king − man + woman = queen analogy.

● Royalty ● Animals ● People ● Tech ● Food ● Places

Transformer Token Embeddings

In a transformer, the embedding layer is a straightforward lookup table: an nn.Embedding module in PyTorch that wraps a learnable matrix. The forward pass is just a gather operation — index into the matrix by token ID to retrieve the corresponding row.

python

import torch
import torch.nn as nn

vocab_size = 50_257   # GPT-2 vocabulary size
d_model    = 768      # GPT-2 base hidden dimension

# The embedding matrix: shape (50257, 768)
# ~38.6M learnable parameters
token_embedding = nn.Embedding(vocab_size, d_model)

# Lookup: shape (batch, seq_len) → (batch, seq_len, d_model)
token_ids = torch.tensor([[9906, 11, 1917, 0]])  # "Hello, world!"
x = token_embedding(token_ids)                   # [1, 4, 768]

# Equivalent to direct matrix indexing:
assert torch.allclose(x, token_embedding.weight[token_ids])

A critical implementation detail: most large language models use weight tying. The embedding matrix E (input) and the unembedding matrix U (the final linear layer that projects back to vocabulary logits) share the same weights.

This has two motivations. First, it halves the parameter count at the vocabulary interface — for GPT-3, this saves ~617M parameters. Second, there's a conceptual justification: if token A should predict token B, then A and B should be similar in embedding space. Using the same matrix for both encoding and decoding enforces this consistency.

Weight Tying

logits = x_final · E^T + b

where E is the same matrix used for token lookup.
The predicted probability of next token j is: softmax(logits)[j]

The Position Problem

Token embeddings encode what each token is, but not where it appears in the sequence. This is a fundamental problem for transformers.

Self-attention is permutation-equivariant: if you shuffle the input tokens, the attention outputs shuffle in exactly the same way. The model has no intrinsic sense of order. "The cat sat on the mat" and "mat the on sat cat The" would, without positional information, produce identical attention patterns and identical intermediate representations for each token.

The original transformer paper (Vaswani et al., 2017) solved this by adding a positional encoding vector to each token embedding before feeding into the first attention layer. Different positions get different positional vectors, and the model learns to use this positional information to understand sequence order.

Sinusoidal Positional Encoding

Vaswani et al.'s original solution uses a fixed (non-learned) sinusoidal formula:

PE(pos, 2i) = sin(pos / 10000 2i/d model) PE(pos, 2i+1) = cos(pos / 10000 2i/d model)

Where pos is the token's position in the sequence (0, 1, 2, …) and i is the dimension index. Even dimensions use sine, odd dimensions use cosine.

The intuition: think of position as a number written in a mixed-frequency numeral system. Each dimension pair (2i, 2i+1) operates at a different frequency — low dimensions change slowly (long wavelength, capturing coarse position), high dimensions change quickly (short wavelength, capturing fine position). Together, they form a unique "fingerprint" for every position up to sequence lengths far beyond training data.

Key properties of sinusoidal PE:

No learned parameters — entirely determined by the formula.
Bounded — all values in [-1, 1], same scale as embeddings.
Relative positions have a consistent linear relationship — there exists a linear transformation T such that PE(pos+k) = T · PE(pos) for any offset k.
Extrapolates beyond training length — any position can be encoded, even positions never seen during training.

Positional Encoding Heatmap Interactive

Each column is a position (0–127), each row is a dimension. Notice how low dimensions oscillate slowly (capturing coarse position) while high dimensions oscillate quickly (capturing fine position).

Showing: sin + cos combined

← Position 0

Position 127 →

Rotary Positional Embeddings (RoPE)

The problem with additive positional encodings (both sinusoidal and learned) is subtle but consequential: after adding position to the token embedding, the two signals are entangled. The attention mechanism computes dot products between query and key vectors:

Attention(Q, K, V) = softmax(QK T / \sqrtd k) \cdot V

When position is added to the embedding, the query-key dot product at positions m and n mixes absolute position signals in a way that makes relative position reasoning non-trivial. Ideally, what we want is for the dot product q_m · k_n to depend only on the relative position m - n, not on the absolute values of m and n.

RoPE (Su et al., 2021, "RoFormer") achieves exactly this by encoding position as a rotation applied to query and key vectors, rather than an addition to the embedding.

The key insight: if you rotate a query vector by angle θm and a key vector by angle θn, their dot product depends only on the difference θm − θn (relative position), because:

(R m q) \cdot (R n k) = q T R m T R n k = q T R m-n k

The rotation is applied in pairs of dimensions. For each pair of dimensions (2i, 2i+1):

[q 2i, q 2i+1] \to [q 2i cos(mθ i) - q 2i+1 sin(mθ i), q 2i sin(mθ i) + q 2i+1 cos(mθ i)] where θ i = 10000 -2i/d

The rotation frequency θ_i follows the same 10000^{-2i/d} schedule as the original sinusoidal PE — low dimensions rotate slowly (large wavelength, captures coarse relative position), high dimensions rotate quickly.

RoPE's advantages:

Relative position is baked in — attention scores naturally depend on m−n, not m and n individually.
Better length generalization — models trained on shorter sequences can attend more gracefully to longer ones (with techniques like RoPE scaling / YaRN).
No extra parameters — like sinusoidal PE, RoPE is determined by the formula.
Widely adopted — LLaMA (all versions), Mistral, Gemma, Falcon, GPT-NeoX, and most recent models use RoPE.

RoPE — Rotation in Embedding Space Interactive

Each token's query/key vector is rotated by an angle proportional to its position. Drag the slider to see how different positions rotate the same vector, and observe that the relative angle between two positions stays constant.

Position A: 2 Position B: 7 Relative distance: 5

Embedding Dimensions & Model Scale

The embedding dimension d_model is one of the primary architectural hyperparameters of a transformer, and it determines the width of the network's information highway. Every intermediate representation — queries, keys, values, feed-forward activations — is expressed in this space.

Conventional wisdom holds that d_model should be a multiple of 64 (for efficient memory alignment on GPUs) and that d_model = n_heads × 64 is a common choice, giving each attention head 64 dimensions for its query/key/value projections.

Model	d_model	n_heads	n_layers	Vocab size	Embedding params
BERT-base	768	12	12	30,522	~23M
GPT-2 base	768	12	12	50,257	~39M
GPT-2 XL	1,600	25	48	50,257	~80M
GPT-3 175B	12,288	96	96	50,257	~617M
LLaMA 2 7B	4,096	32	32	32,000	~131M
LLaMA 3 8B	4,096	32	32	128,256	~525M
LLaMA 2 70B	8,192	64	80	32,000	~262M

A few observations from this table:

LLaMA 3's jump from 32K to 128K vocabulary quadrupled the embedding table to ~525M params — a significant fraction of the model's total. Larger vocabularies improve tokenization efficiency (fewer tokens per document = lower inference cost) but increase memory at the vocabulary interface.
GPT-3's embedding table alone (~617M params) is larger than many smaller models. With weight tying, the unembedding layer is free, but the vocabulary embedding is a real cost.
Larger d_model gives each attention head more expressive capacity but increases compute quadratically relative to batch size and sequence length in the feedforward layers.

💡 Why bigger d_model helps

The embedding dimension determines how much information can flow between layers. If d_model is too small, the representations become an information bottleneck — the model can't retain all the contextual nuance needed for complex reasoning. Scaling laws research (Kaplan et al., 2020; Hoffmann et al., 2022) shows that increasing model width (d_model, n_heads) and depth (n_layers) improves performance predictably — but optimal compute allocation matters. Chinchilla showed that many earlier models were undertrained for their size.

Code Examples

Theory is best anchored by working code. Here are practical examples covering the full tokenization pipeline using tiktoken (OpenAI's tokenizer) and HuggingFace transformers.

tiktoken — GPT-4 tokenization

python

import tiktoken

# cl100k_base is used by GPT-4, GPT-3.5-turbo, and text-embedding-ada-002
enc = tiktoken.get_encoding("cl100k_base")

text = "Tokenization is surprisingly fascinating. Let's explore."
tokens = enc.encode(text)

print(f"Token IDs:  {tokens}")
print(f"Token count: {len(tokens)}")

# Inspect each token
for tid in tokens:
    token_bytes = enc.decode_single_token_bytes(tid)
    print(f"  {tid:6d}  {token_bytes!r}")

# Decode back to string
assert enc.decode(tokens) == text

# ── Comparing tokenization across models ──────────────────
gpt2_enc  = tiktoken.get_encoding("gpt2")         # 50,257 tokens
gpt4_enc  = tiktoken.get_encoding("cl100k_base")  # 100,277 tokens
gpt4o_enc = tiktoken.get_encoding("o200k_base")   # 200,019 tokens

code_sample = "def transformer_block(x, attn, ffn):\n    return ffn(x + attn(x))"

for name, enc in [("gpt2", gpt2_enc), ("gpt4", gpt4_enc), ("gpt4o", gpt4o_enc)]:
    n = len(enc.encode(code_sample))
    print(f"{name:8s}: {n} tokens")
# gpt2  : 22 tokens
# gpt4  : 19 tokens   (better code tokenization in cl100k)
# gpt4o : 16 tokens   (even better in o200k)

HuggingFace tokenizers — comparing BERT, GPT-2, LLaMA

python

from transformers import AutoTokenizer

# GPT-2 (byte-level BPE, Ġ = space prefix)
gpt2 = AutoTokenizer.from_pretrained("gpt2")
print("GPT-2:", gpt2.tokenize("tokenization is fascinating"))
# ['token', 'ization', 'Ġis', 'Ġfascinating']

# BERT (WordPiece, ## = continuation)
bert = AutoTokenizer.from_pretrained("bert-base-uncased")
print("BERT:", bert.tokenize("tokenization is fascinating"))
# ['token', '##ization', 'is', 'fascinating']

# LLaMA 2 (SentencePiece BPE, ▁ = word start)
llama = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
print("LLaMA:", llama.tokenize("tokenization is fascinating"))
# ['▁token', 'ization', '▁is', '▁fas', 'cin', 'ating']

# ── Chat templates ─────────────────────────────────────────
# Modern instruction-tuned models wrap messages in special tokens.
# NEVER skip apply_chat_template for chat models.

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user",   "content": "What is a transformer?"},
]

# Using a model that supports chat templates (e.g. Llama-3-8B-Instruct)
# chat_enc = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
# formatted = chat_enc.apply_chat_template(messages, tokenize=False)
# print(formatted)
# <|begin_of_text|><|start_header_id|>system<|end_header_id|>
# You are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>
# What is a transformer?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Sinusoidal positional encoding in PyTorch

python

import torch
import math

def sinusoidal_positional_encoding(max_seq_len: int, d_model: int) -> torch.Tensor:
    """
    Returns PE matrix of shape (max_seq_len, d_model).

    PE(pos, 2i)   = sin(pos / 10000^(2i/d_model))
    PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
    """
    PE = torch.zeros(max_seq_len, d_model)
    pos = torch.arange(max_seq_len, dtype=torch.float).unsqueeze(1)      # (L, 1)
    div = torch.exp(
        torch.arange(0, d_model, 2, dtype=torch.float) *
        (-math.log(10000.0) / d_model)
    )                                                                      # (d/2,)

    PE[:, 0::2] = torch.sin(pos * div)   # even dims
    PE[:, 1::2] = torch.cos(pos * div)   # odd dims
    return PE  # (max_seq_len, d_model)

# ── RoPE implementation ────────────────────────────────────
def apply_rope(x: torch.Tensor, positions: torch.Tensor) -> torch.Tensor:
    """
    Apply Rotary Positional Embedding to queries or keys.

    Args:
        x:         (batch, heads, seq_len, head_dim)
        positions: (batch, seq_len) — position indices

    Returns:
        x with RoPE applied, same shape as input.
    """
    batch, heads, seq_len, head_dim = x.shape
    assert head_dim % 2 == 0

    # Compute rotation frequencies (same as sinusoidal)
    i = torch.arange(0, head_dim, 2, dtype=torch.float, device=x.device)
    theta = 1.0 / (10000.0 ** (i / head_dim))                # (head_dim/2,)

    # angles[b, s, i] = positions[b, s] * theta[i]
    angles = positions.float().unsqueeze(-1) * theta          # (batch, seq_len, head_dim/2)

    sin = angles.sin().unsqueeze(1)   # (batch, 1, seq_len, head_dim/2)
    cos = angles.cos().unsqueeze(1)   # (batch, 1, seq_len, head_dim/2)

    # Split into even and odd dimensions
    x_even = x[..., 0::2]   # (batch, heads, seq_len, head_dim/2)
    x_odd  = x[..., 1::2]

    # Apply rotation: [x_even, x_odd] → [x_even*cos - x_odd*sin, x_even*sin + x_odd*cos]
    x_rot_even = x_even * cos - x_odd * sin
    x_rot_odd  = x_even * sin + x_odd * cos

    # Interleave back
    x_out = torch.stack([x_rot_even, x_rot_odd], dim=-1)
    return x_out.flatten(-2)          # (batch, heads, seq_len, head_dim)


# ── Full embedding layer (token + position) ────────────────
class TransformerEmbedding(nn.Module):
    def __init__(self, vocab_size: int, d_model: int, max_seq_len: int = 2048,
                 dropout: float = 0.1, use_learned_pos: bool = True):
        super().__init__()
        self.token_emb = nn.Embedding(vocab_size, d_model)
        self.use_learned_pos = use_learned_pos

        if use_learned_pos:
            # GPT-2 style: learned positional embedding
            self.pos_emb = nn.Embedding(max_seq_len, d_model)
        else:
            # Original transformer: fixed sinusoidal
            pe = sinusoidal_positional_encoding(max_seq_len, d_model)
            self.register_buffer('pos_enc', pe)  # not a parameter

        self.dropout = nn.Dropout(dropout)
        self.d_model = d_model

    def forward(self, token_ids: torch.Tensor) -> torch.Tensor:
        # token_ids: (batch, seq_len)
        seq_len = token_ids.shape[1]
        tok = self.token_emb(token_ids)                        # (B, L, d_model)

        if self.use_learned_pos:
            positions = torch.arange(seq_len, device=token_ids.device)
            pos = self.pos_emb(positions)                      # (L, d_model)
        else:
            pos = self.pos_enc[:seq_len]                       # (L, d_model)

        # Scale token embeddings by sqrt(d_model) — from original paper
        return self.dropout(tok * math.sqrt(self.d_model) + pos)

Exploring vocabulary structure

python

import tiktoken

enc = tiktoken.get_encoding("cl100k_base")
print(f"Vocabulary size: {enc.n_vocab:,}")   # 100,277

# ── How are numbers tokenized? ─────────────────────────────
# Single-digit numbers are always 1 token
# Multi-digit numbers may split at varying boundaries
for n in [1, 12, 123, 1234, 12345, 123456, 1234567]:
    toks = enc.encode(str(n))
    pieces = [enc.decode([t]) for t in toks]
    print(f"{n:>8,} → {pieces} ({len(toks)} token{'s' if len(toks)>1 else ''})")
# 1        → ['1'] (1 token)
# 1,234    → ['1234'] (1 token)
# 12,345   → ['123', '45'] (2 tokens)  ← arbitrary split!
# 1,234,567→ ['123', '456', '7'] (3 tokens)

# ── Tokenization pitfalls ──────────────────────────────────
# Counting characters in tokens is non-trivial for the model
# because each token is a different length
word = "strawberry"
toks = enc.encode(word)
for t in toks:
    print(repr(enc.decode([t])))  # 'str', 'aw', 'berry'
# 3 tokens — the model can't easily "see" that there are 3 r's!

# ── Multilingual tokenization efficiency ──────────────────
texts = {
    "English": "The quick brown fox",
    "Spanish": "El rápido zorro marrón",
    "Chinese": "快速的棕色狐狸",
    "Arabic":  "الثعلب البني السريع",
    "Code":    "def quick_sort(arr): return sorted(arr)",
}
for lang, text in texts.items():
    n = len(enc.encode(text))
    print(f"{lang:12s}: {n:3d} tokens  (for: {text!r})")
# English  :   5 tokens  — 1 token/word (efficient)
# Spanish  :   8 tokens  — more tokens for diacritics
# Chinese  :  10 tokens  — ~3 bytes/char → more tokens
# Arabic   :  11 tokens  — similar pattern
# Code     :   9 tokens  — special tokens for syntax

The multilingual efficiency gap is one reason newer models like LLaMA 3 expanded their vocabulary from 32K to 128K tokens — common Chinese, Japanese, Korean, and Arabic sequences can now get dedicated tokens rather than fragmenting into many byte-level pieces.

🧭 What comes next

Now that we understand how text becomes vectors, the next question is: how do those vectors interact? In Article 02: Attention & Transformer Blocks, we'll dissect the self-attention mechanism — why it works, how queries/keys/values actually compute attention weights, and how the full transformer block assembles these primitives into a system capable of learning language.

References

Seminal papers and key works referenced in this article.

Sennrich et al. "Neural Machine Translation of Rare Words with Subword Units." ACL, 2016. arXiv
Kudo & Richardson. "SentencePiece: A simple and language independent subword tokenizer." EMNLP, 2018. arXiv
Mikolov et al. "Efficient Estimation of Word Representations in Vector Space." ICLR Workshop, 2013. arXiv
Vaswani et al. "Attention Is All You Need." NeurIPS, 2017. arXiv
Su et al. "RoFormer: Enhanced Transformer with Rotary Position Embedding." 2021. arXiv
Radford et al. "Language Models are Unsupervised Multitask Learners." OpenAI, 2019.