BERT (Devlin 2019)

Chapter 0: Why Bidirectional?

Imagine you're reading the sentence: "The man went to the bank to deposit his check." What does "bank" mean? A financial institution, obviously. But now consider: "The man went to the bank to catch some fish." Same word, completely different meaning. How did you know? Because you read the entire sentence — including the words after "bank."

Before BERT, the dominant language model paradigm was left-to-right. Models like GPT read text strictly from left to right — when processing "bank," they could only see "The man went to the" — the five words before it. The critical disambiguating context ("deposit his check" vs "catch some fish") comes after the word, so a left-to-right model is flying blind.

This is a fundamental limitation. In natural language, meaning flows in both directions. The subject constrains the verb, but the verb also constrains the subject. The object constrains the preposition, but the preposition also constrains the object. A truly capable language understanding system needs to see context from both sides simultaneously.

The core insight of BERT: Language understanding is fundamentally different from language generation. Generation must be left-to-right (you can't generate a word that depends on future words you haven't generated yet). But understanding has no such constraint — when you're classifying sentiment or answering questions, you have the entire input available. Why not use all of it?

Previous attempts at bidirectionality existed. ELMo (Peters et al., 2018) trained two separate LSTMs — one left-to-right and one right-to-left — and concatenated their hidden states. But this is "shallow" bidirectionality: the two directions never interact during encoding. The left-to-right LSTM doesn't know what the right-to-left LSTM is thinking, and vice versa.

What we really want is deep bidirectionality — a model where, at every layer, every token can attend to every other token in both directions. The Transformer encoder architecture provides exactly this: self-attention has no inherent directionality. Token 5 can attend to token 3 (left context) and token 7 (right context) with equal ease. The Transformer is naturally bidirectional — the bottleneck was never the architecture, but the training objective.

Unidirectional vs Bidirectional Context

Click on a word to see what context a left-to-right model (top) vs a bidirectional model (bottom) can use. Notice how the bidirectional model sees the full sentence, resolving ambiguity that the unidirectional model cannot.

Click a word

The problem is: how do you train a deep bidirectional model? You can't use standard language modeling (predict the next token) because in a bidirectional model, the "answer" would leak through the attention mechanism. If every token can attend to every other token, and you're trying to predict token 5, the model can simply look at position 5 and read the answer. The training signal would be trivial and the model would learn nothing.

This is the catch-22 that blocked bidirectional pre-training for years. GPT chose left-to-right and sacrificed bidirectionality. ELMo used shallow bidirectionality as a compromise. BERT found a different solution entirely: masking.

Instead of predicting the next token, BERT randomly masks some tokens and asks the model to predict them from context. Since the masked token is replaced with a special [MASK] placeholder, the model can't cheat — it must genuinely use the surrounding context (from both directions) to reconstruct the missing word. This is the Masked Language Model (MLM) objective, and it's the key innovation that makes deep bidirectional pre-training possible.

Approach	Direction	Context for "bank"	Architecture
GPT	Left-to-right	"The man went to the"	Transformer decoder
ELMo	Shallow bidir	Two LSTMs, never interact	Stacked biLSTM
BERT	Deep bidir	Entire sentence, all layers	Transformer encoder

Why "BERT"? Bidirectional Encoder Representations from Transformers. The name captures the three key ideas: (1) bidirectional — sees both left and right context, (2) encoder — uses the Transformer encoder (not decoder), (3) representations — produces general-purpose features that can be fine-tuned for any task.

BERT's impact was immediate and devastating to the status quo. It set new state-of-the-art results on 11 NLP benchmarks simultaneously upon release, improving the best previous results by large margins (e.g., 7.7% on GLUE, 4.6 F1 on SQuAD 2.0). It showed that a single pre-trained model, fine-tuned with minimal task-specific modifications, could outperform complex, task-specific architectures that had been engineered over years. This was the "ImageNet moment" for NLP.

Why can't you train a bidirectional Transformer with a standard language modeling (predict-next-token) objective?

Because in a bidirectional model, every token can attend to every other token — if you're predicting token 5, the model can look at position 5 and read the answer directly, making the training signal trivial Because bidirectional models are too slow to train on next-token prediction Because Transformers can only process text in one direction at a time

Chapter 1: Masked Language Model (MLM)

BERT's primary pre-training objective is the Masked Language Model (MLM). The idea is deceptively simple: randomly mask some percentage of the input tokens, then train the model to predict the original tokens from the corrupted input.

Specifically, for each training sequence, BERT randomly selects 15% of the token positions for prediction. But there's a subtle twist in how these selected tokens are treated:

80% of selected tokens

Replace with [MASK] token. "The cat sat" → "The [MASK] sat"

↓

10% of selected tokens

Replace with a random token. "The cat sat" → "The pizza sat"

↓

10% of selected tokens

Keep unchanged. "The cat sat" → "The cat sat"

Why this 80/10/10 split? It addresses a critical problem: the mismatch between pre-training and fine-tuning.

During fine-tuning, the model never sees the [MASK] token — real inputs don't have masks. If the model learned to rely on the presence of [MASK] as a signal ("oh, there's a mask, I should predict something"), that skill wouldn't transfer to fine-tuning. The 10% random replacement forces the model to stay uncertain about which tokens are corrupted, so it must maintain a good representation of every token. The 10% unchanged tokens ensure the model learns that even "correct-looking" tokens may need prediction.

The 15% masking rate is a careful balance. Mask too few tokens (say 5%) and each training step provides very little signal — you're only learning from 5% of the sequence. Mask too many (say 50%) and there isn't enough context left to make meaningful predictions — the task becomes too hard and the model can't learn. 15% is the sweet spot that maximizes learning signal while keeping the task tractable. Later work (SpanBERT, XLNet) explored different masking strategies and rates.

The MLM loss is simply the cross-entropy loss between the model's predicted distribution and the true token, summed only over the masked positions:

L_MLM = - ∑_{i ∈ masked} log p(x_i | x_\masked)

Where x_i is the true token at position i, x_\masked is the input with masked tokens replaced, and p(x_i | x_\masked) is the model's predicted probability for the correct token at position i.

Let's walk through a concrete example. Take the sentence "The cat sat on the mat" with 6 tokens. With 15% masking, we'd mask about 1 token (0.15 × 6 ≈ 1). Say we select "cat" (position 2):

python
# MLM: one training step
import torch
import torch.nn.functional as F

# Original tokens: [CLS] The cat sat on the mat [SEP]
input_ids = [101, 1996, 4937, 2938, 2006, 1996, 13523, 102]

# Masking "cat" (position 2) — with 80% prob, replace with [MASK]=103
masked_input = [101, 1996, 103, 2938, 2006, 1996, 13523, 102]
#                                ^^^^ [MASK] replaces "cat"

# labels: -100 means "don't compute loss here"
labels = [-100, -100, 4937, -100, -100, -100, -100, -100]
#                  ^^^^ only compute loss at masked position

# Forward pass: model outputs logits [8, vocab_size=30522]
logits = model(torch.tensor([masked_input]))  # [1, 8, 30522]

# Loss at position 2: cross-entropy between logits[2] and true id 4937
loss = F.cross_entropy(
    logits[0, 2, :],           # predicted distribution over 30522 vocab
    torch.tensor(4937)        # true token id for "cat"
)
# loss ≈ 5.2 (early training) → ≈ 0.3 (late training)

Notice that the model receives the full bidirectional context ("The [MASK] sat on the mat") and must predict "cat" from that context. It sees "sat" and "on the mat" to the right, and "The" to the left. Both directions contribute to the prediction. This is exactly the deep bidirectional conditioning that GPT-style models can't do.

Masked Language Model Simulator

Type a sentence and click "Mask & Predict" to see which tokens get masked (15% rate) and what a model might predict. The bar chart shows the model's confidence distribution over candidate words. Try different sentences to see how context from both sides helps prediction.

Click to mask tokens

Connection to denoising autoencoders

MLM is, mathematically, a denoising autoencoder. The input is corrupted (masking), passed through the model, and the model must reconstruct the original. This connection to the autoencoder literature (Vincent et al., 2008) was not an accident — Devlin et al. explicitly drew on this tradition. The key difference is that BERT operates on discrete tokens (not continuous pixels) and uses cross-entropy loss (not reconstruction error).

The denoising perspective also explains why MLM works so well as a pre-training objective: by learning to reconstruct corrupted text, the model must build rich internal representations that capture syntax, semantics, and world knowledge. A model that can correctly predict "[MASK] sat on the mat" → "cat" must know that cats sit on mats, that "sat" is past tense, that articles precede nouns — all learned implicitly from the reconstruction task.

Computational cost of 15% masking

One downside of MLM compared to standard language modeling: only 15% of tokens contribute to the training loss per step. In a left-to-right language model like GPT, every token contributes — the model predicts the next token at every position. This means MLM needs roughly 6-7x more training steps to see the same number of prediction tasks. BERT compensates with a larger batch size (256 sequences) and longer training (1M steps), but this inefficiency is a real cost.

Why does BERT use the 80/10/10 masking strategy instead of always replacing selected tokens with [MASK]?

To reduce the mismatch between pre-training (which has [MASK] tokens) and fine-tuning (which does not) — the 10% random and 10% unchanged tokens force the model to maintain good representations for all tokens, not just masked ones To make training faster by reducing the number of masked positions To prevent the model from memorizing the training data

Chapter 2: Next Sentence Prediction (NSP)

Many important NLP tasks — question answering, natural language inference, paraphrase detection — require understanding the relationship between two sentences, not just individual sentence meaning. MLM trains the model to understand tokens in context, but it doesn't explicitly teach the model about inter-sentence relationships.

BERT's second pre-training objective addresses this: Next Sentence Prediction (NSP). The idea is straightforward. Given two sentences A and B, the model must predict whether B actually follows A in the original text (label: IsNext) or whether B is a random sentence from the corpus (label: NotNext).

50% of training pairs

IsNext: B actually follows A in the corpus.
"The cat sat on the mat." → "It purred softly."

↓

50% of training pairs

NotNext: B is a random sentence from the corpus.
"The cat sat on the mat." → "The stock market fell today."

The input format packs both sentences into a single sequence with special tokens:

[CLS] Sentence A tokens [SEP] Sentence B tokens [SEP]

[CLS] is a special classification token prepended to every input. Its final hidden state is used as the "aggregate sequence representation" for classification tasks. [SEP] is a separator token that marks the boundary between sentences.

The NSP prediction uses the final hidden state of the [CLS] token, fed through a linear layer with softmax:

p(IsNext | h_[CLS]) = softmax(W · h_[CLS] + b)

Where h_[CLS] is the d-dimensional hidden state of the [CLS] token at the final layer, and W is a 2 × d weight matrix. The loss is binary cross-entropy.

python
# NSP training example
# Positive pair (IsNext):
sent_a = "The cat sat on the mat."
sent_b = "It purred softly in the sunlight."
label  = 1  # IsNext

# Negative pair (NotNext):
sent_a = "The cat sat on the mat."
sent_b = "Quantum entanglement enables teleportation."
label  = 0  # NotNext

# Input to BERT: [CLS] sent_a [SEP] sent_b [SEP]
# Token type IDs:  0    0...0   0    1...1   1
# The type IDs tell BERT which tokens belong to sentence A vs B

# Forward pass
cls_hidden = model(input_ids, token_type_ids)[0][:, 0, :]  # [batch, 768]
nsp_logits = nsp_head(cls_hidden)  # [batch, 2]
nsp_loss   = F.cross_entropy(nsp_logits, labels)

Next Sentence Prediction Visualizer

See how BERT processes a sentence pair for NSP. The [CLS] token aggregates information from both sentences through self-attention to make the IsNext/NotNext prediction. Click "Shuffle B" to swap sentence B for a random one and see the prediction change.

IsNext: 92%

The NSP controversy

NSP was perhaps the most debated design choice in BERT. Later papers showed it may actually hurt performance:

Paper	Finding
RoBERTa (Liu et al., 2019)	Removing NSP improves results on most benchmarks. NSP's benefit came from the sentence-pair formatting, not the objective itself.
ALBERT (Lan et al., 2020)	Replaced NSP with Sentence Order Prediction (SOP) — predict whether A-B or B-A is the correct order. SOP is harder and more useful.
SpanBERT (Joshi et al., 2020)	Removing NSP and using single-sentence inputs with span masking outperforms BERT on most tasks.

The consensus that emerged: NSP is too easy. When the negative pairs are random sentences from different documents, they're so obviously unrelated that the model can solve NSP by detecting topic similarity alone, without learning genuine discourse coherence. SOP (same document, swapped order) forces the model to learn actual sentence ordering, which is a harder and more useful skill.

Despite its flaws, NSP was historically important. It introduced the idea that pre-training objectives could explicitly teach inter-sentence reasoning, not just token-level prediction. The [CLS] token paradigm — using a special token's representation for sequence-level tasks — became standard across the field, even in models that dropped NSP itself.

Total pre-training loss

BERT's total pre-training loss is simply the sum of the MLM and NSP losses:

L_total = L_MLM + L_NSP

Both losses are computed on every training example. The MLM loss provides the rich token-level understanding signal, while NSP provides a weaker sentence-level signal. In practice, MLM does most of the heavy lifting — which is why removing NSP doesn't hurt much.

Why did later papers (RoBERTa, SpanBERT) find that NSP doesn't help and may actually hurt performance?

Because NSP requires too much compute, slowing down training Because bidirectional models can't learn sentence relationships Because NSP is too easy — random negative sentences are so obviously unrelated that the model learns topic detection instead of genuine discourse coherence, and the two-sentence format may hurt single-sentence task representations

Chapter 3: Pre-training Data

BERT's representations are only as good as the data they're learned from. Devlin et al. used two unlabeled text corpora totaling approximately 3.3 billion words:

Corpus	Size	Description
BooksCorpus	~800M words	11,038 unpublished books from smashwords.com. Long-form, coherent text across many genres. Critical for learning long-range dependencies and narrative structure.
English Wikipedia	~2,500M words	Text content only (no tables, lists, or headers). Covers factual knowledge across all domains. Provides encyclopedic breadth.

By 2018 standards, 3.3 billion words was substantial but not extreme. For context, GPT-2 (released months later) trained on 8 billion words, and GPT-3 trained on 300 billion tokens. BERT showed that even with relatively modest data, the right training objective (MLM + bidirectionality) could produce remarkable results.

Why BooksCorpus matters: Wikipedia provides breadth — facts about every topic. But Wikipedia articles are short, encyclopedic, and formulaic. BooksCorpus provides depth — long, narrative text where pronouns refer to characters introduced pages ago, where plot threads connect across chapters, where writing styles vary from literary fiction to romance to sci-fi. This diversity forces BERT to learn genuine language understanding, not just "how to complete a Wikipedia sentence."

Input format and preprocessing

Each training input is a pair of "sentences" (actually text spans) sampled from the corpus. The preprocessing pipeline works as follows:

Step 1: Sample document

Randomly select a document from the corpus

↓

Step 2: Sample sentence A

Pick a random sentence from the document

↓

Step 3: Sample sentence B

50%: actual next sentence. 50%: random sentence from another document

↓

Step 4: Tokenize & mask

WordPiece tokenization, truncate to 512 tokens total, apply 15% MLM masking

The maximum sequence length is 512 tokens (including [CLS] and [SEP] tokens). This was a GPU memory constraint in 2018 — longer sequences require quadratically more memory for the attention matrix (512² = 262K entries per head per layer). Modern models have pushed this to 2K, 4K, 8K, and beyond using techniques like FlashAttention and sparse attention.

Training efficiency trick

BERT uses a clever training efficiency trick: for the first 90% of training steps, the maximum sequence length is reduced to 128 tokens. Only the final 10% of training uses the full 512-token length. This works because:

1. Short sequences are much cheaper: 128² / 512² = 1/16 the attention cost.

2. Most of language understanding can be learned from local context (nearby words).

3. The final 10% at 512 tokens teaches long-range dependencies without paying full cost for the entire training run.

This trick alone reduced total training time by approximately 40% with negligible performance loss.

Pre-training Data Pipeline

Watch how a raw document is processed into BERT training examples. Each step shows the transformation. Click "Process" to animate the pipeline for a new document.

Click to see pipeline

Training hyperparameters

Parameter	Value	Rationale
Batch size	256 sequences	Large batch for stable gradients
Steps	1,000,000	~40 epochs over the 3.3B word corpus
Optimizer	Adam (β₁=0.9, β₂=0.999)	Standard adaptive optimizer
Learning rate	1e-4 (with linear warmup)	10,000 warmup steps, then linear decay
Dropout	0.1 on all layers	Regularization against overfitting
Weight decay	0.01 (L2)	Additional regularization
Hardware	4 Cloud TPUs (16 TPU chips)	BERT-Base: 4 days, BERT-Large: ~4 days on 16 TPUs

Why did BERT train with 128-token sequences for 90% of training, switching to 512 tokens only for the final 10%?

Because the model needs to learn short-range patterns before long-range ones Because attention cost is quadratic in sequence length (128² is 16x cheaper than 512²), so training on short sequences first saves ~40% of total compute while most language understanding transfers from local context Because the BooksCorpus only has short sentences

Chapter 4: Architecture

BERT uses the Transformer encoder — specifically, just the encoder half of the original Transformer (Vaswani et al., 2017). No decoder, no cross-attention, no autoregressive masking. Pure bidirectional self-attention, stacked into deep layers.

Two model sizes were released:

Parameter	BERT-Base	BERT-Large
Layers (L)	12	24
Hidden size (H)	768	1024
Attention heads (A)	12	16
Head dimension (H/A)	64	64
FFN intermediate	3072 (4×H)	4096 (4×H)
Total parameters	110M	340M
Vocab size	30,522	30,522
Max sequence length	512	512

BERT-Base was deliberately sized to match GPT-1 (also 12 layers, 768 hidden, 110M parameters) to enable a fair comparison: same capacity, different training objective (bidirectional MLM vs unidirectional LM). BERT-Base outperformed GPT-1 on every benchmark, demonstrating that the bidirectional objective was the key ingredient, not model size.

Each Transformer encoder layer

Every layer in BERT follows the same structure:

Multi-Head Self-Attention

Each token attends to all other tokens (both directions). 12 heads, each with d_k=64. Output: [T, 768]

↓ + residual + LayerNorm

Feed-Forward Network

Two linear layers with GELU activation: 768 → 3072 → 768. Applied independently to each token.

↓ + residual + LayerNorm

The residual connections and layer normalization follow the original Transformer design. Each sub-layer (attention and FFN) has a residual connection around it, followed by layer normalization:

output = LayerNorm(x + SubLayer(x))

This is the "post-norm" pattern from the original Transformer. (Later models like GPT-2 switched to "pre-norm": output = x + SubLayer(LayerNorm(x)), which is more stable for deep networks.)

BERT's input representation

BERT's input is constructed from three embedding types, summed together:

e_input = e_token + e_segment + e_position

Embedding	Purpose	Details
Token embedding	Maps each WordPiece token to a vector	Learned, 30,522 × 768
Segment embedding	Distinguishes sentence A from sentence B	Learned, 2 × 768 (only two segments)
Position embedding	Encodes position in the sequence	Learned, 512 × 768 (not sinusoidal)

Unlike the original Transformer which used fixed sinusoidal position encodings, BERT uses learned position embeddings. Each of the 512 possible positions has its own learned 768-dimensional vector. This means BERT cannot handle sequences longer than 512 tokens — there's no position embedding for position 513. Later models (RoPE, ALiBi) solved this with relative or extrapolatable position encodings.

Why segment embeddings matter: When BERT processes "[CLS] The cat sat [SEP] It purred [SEP]", the token "the" in sentence A and "it" in sentence B have the same token embedding. But they get different segment embeddings (e_A vs e_B), so the model can distinguish which sentence each token belongs to. This is essential for tasks like NLI where the two sentences play different roles (premise vs hypothesis).

BERT Architecture Explorer

Click on different parts of the BERT architecture to see data flow and tensor shapes at each stage. The three embedding types (token, segment, position) are summed and fed through L=12 Transformer encoder layers.

GELU activation

BERT was one of the first major models to use the GELU (Gaussian Error Linear Unit) activation function instead of ReLU in the FFN layers:

GELU(x) = x · Φ(x) ≈ 0.5x(1 + tanh[√(2/π)(x + 0.044715x³)])

Where Φ(x) is the standard Gaussian CDF. GELU is smooth (no sharp kink at 0 like ReLU) and stochastically gates the input based on its magnitude. Larger positive values pass through almost unchanged; negative values are suppressed but not completely zeroed (unlike ReLU). This smoother behavior is believed to help optimization in deep Transformer networks.

python
import torch
import torch.nn as nn

class BertLayer(nn.Module):
    def __init__(self, H=768, A=12, intermediate=3072):
        super().__init__()
        # Multi-head self-attention
        self.attn = nn.MultiheadAttention(H, A, batch_first=True)
        self.ln1 = nn.LayerNorm(H)
        # Feed-forward network
        self.ffn = nn.Sequential(
            nn.Linear(H, intermediate),   # 768 → 3072
            nn.GELU(),                     # smooth activation
            nn.Linear(intermediate, H),   # 3072 → 768
        )
        self.ln2 = nn.LayerNorm(H)
        self.dropout = nn.Dropout(0.1)

    def forward(self, x):
        # x: [batch, seq_len, 768]
        attn_out, _ = self.attn(x, x, x)   # self-attention
        x = self.ln1(x + self.dropout(attn_out))  # residual + norm
        ffn_out = self.ffn(x)
        x = self.ln2(x + self.dropout(ffn_out))  # residual + norm
        return x  # [batch, seq_len, 768]

BERT-Base and GPT-1 have nearly identical architectures (12 layers, 768 hidden, 110M parameters). What is the key difference that makes BERT outperform GPT-1?

BERT uses a larger vocabulary BERT uses more training data BERT uses bidirectional self-attention with MLM pre-training (seeing both left and right context), while GPT-1 uses unidirectional (left-to-right only) attention with next-token prediction

Chapter 5: WordPiece Tokenization

How do you represent text as numbers for a neural network? The naive approach — one token per word — fails for two reasons. First, the vocabulary would be enormous (English has ~170,000 words in common use, plus names, technical terms, and neologisms). Second, any word not in the vocabulary (an "out-of-vocabulary" or OOV word) can't be processed at all.

BERT uses WordPiece tokenization (Schuster & Nakajima, 2012), a subword method that strikes a balance between character-level and word-level tokenization. The key idea: common words stay as single tokens, but rare words are split into smaller subword pieces.

Input Word	WordPiece Tokens	Why
"the"	["the"]	Common word → single token
"cat"	["cat"]	Common word → single token
"playing"	["play", "##ing"]	Splits into stem + suffix
"unbelievable"	["un", "##bel", "##iev", "##able"]	Rare word → subword pieces
"transformers"	["transform", "##ers"]	Stem is common enough
"xyzzy123"	["x", "##y", "##z", "##zy", "##12", "##3"]	Unknown word → character fallback

The "##" prefix indicates a continuation piece — a subword that's not at the start of a word. "play" is a word-initial piece; "##ing" is a continuation. This lets the model distinguish between "playing" (["play", "##ing"]) and "play" + "ing" as separate words.

Why subword tokenization is brilliant: It eliminates OOV entirely. Even a completely novel word like "COVID-19" can be tokenized as ["CO", "##VI", "##D", "-", "19"]. The model may not have seen "COVID-19" during training, but it has seen each subword piece in other contexts, so it can compose a reasonable representation. No word is truly unknown.

How WordPiece vocabulary is built

The WordPiece vocabulary is constructed through an iterative greedy algorithm similar to Byte Pair Encoding (BPE):

Start

Initialize vocabulary with all individual characters (a, b, c, ..., z, 0-9, etc.) plus special tokens ([CLS], [SEP], [MASK], [PAD], [UNK])

↓

Score pairs

For every adjacent pair of tokens in the corpus, compute: score = freq(pair) / (freq(first) × freq(second)). This favors pairs where the combination is more common than expected.

↓

Merge best pair

Add the highest-scoring pair as a new token. "t" + "h" → "th"

↻ Repeat until vocabulary reaches target size (30,522 for BERT)

The key difference between WordPiece and BPE is the scoring function. BPE simply merges the most frequent pair. WordPiece merges the pair with the highest mutual information — the pair where combining them provides the most information beyond what each piece provides alone. This tends to produce linguistically more meaningful subwords.

WordPiece Tokenization Simulator

Type a word and see how WordPiece breaks it into subword tokens. Common words remain whole; rare words are split. The "##" prefix marks continuation pieces.

Type a word above

BERT's vocabulary of 30,522 tokens covers English text efficiently. The average English word is split into 1.1-1.5 WordPiece tokens, so the effective sequence length (in words) is roughly 340-465 words for a 512-token input. This is plenty for most NLP tasks — GLUE benchmarks have average input lengths of 20-60 words.

Special tokens

Token	ID	Purpose
[CLS]	101	Prepended to every input. Its final hidden state is the sequence representation for classification.
[SEP]	102	Separates sentence A from sentence B. Also appended to the end.
[MASK]	103	Replaces tokens during MLM pre-training.
[PAD]	0	Padding for sequences shorter than max_length.
[UNK]	100	Fallback for characters not in the vocabulary (rare with WordPiece).

What advantage does WordPiece tokenization have over word-level tokenization?

WordPiece eliminates out-of-vocabulary words entirely — any word, even novel ones, can be broken into known subword pieces. It also keeps the vocabulary manageable (30K vs 170K+) while preserving common words as single tokens. WordPiece is faster because it uses fewer tokens per sentence WordPiece produces more accurate embeddings because each subword is more meaningful

Chapter 6: Fine-Tuning

This is where BERT's design pays off spectacularly. After pre-training on 3.3 billion words of unlabeled text, BERT can be fine-tuned for virtually any NLP task with minimal architectural changes — often just a single output layer on top of the pre-trained model.

The fine-tuning recipe is always the same:

Step 1: Initialize

Start with all pre-trained weights (12/24 layers + embeddings)

↓

Step 2: Add task head

Add a small task-specific output layer (usually just one linear layer)

↓

Step 3: Fine-tune all

Train the entire model end-to-end on labeled task data. Small learning rate (2-5e-5), few epochs (2-4).

The key insight: all parameters are fine-tuned, not just the task head. The pre-trained weights shift slightly to adapt to the task, but they retain the vast majority of their pre-trained knowledge. This is why fine-tuning works with so little data — the model doesn't need to learn English from scratch; it just needs to learn the task-specific mapping.

Task-specific architectures

BERT handles four major task types with minimal modifications:

1. Sentence classification (SST-2, CoLA)

Use the [CLS] token's final hidden state as the sentence representation. Add a linear layer: h_[CLS] → logits.

p(class | sentence) = softmax(W · h_[CLS] + b)

2. Sentence pair classification (MNLI, QQP, RTE)

Input: [CLS] sentence A [SEP] sentence B [SEP]. Same as above — use h_[CLS] for classification.

3. Question answering (SQuAD)

Input: [CLS] question [SEP] passage [SEP]. For each passage token, predict two things: "is this the start of the answer?" and "is this the end of the answer?"

p_start(i) = softmax(w_start · h_i)

p_end(i) = softmax(w_end · h_i)

The answer span is the highest-scoring (start, end) pair where start ≤ end.

4. Named entity recognition (CoNLL-2003)

For each token, predict its entity tag (Person, Organization, Location, or None). Use each token's final hidden state with a linear classifier.

p(tag | token_i) = softmax(W · h_i + b)

python
# Fine-tuning BERT for sentence classification
from transformers import BertModel
import torch.nn as nn

class BertClassifier(nn.Module):
    def __init__(self, num_classes=2):
        super().__init__()
        self.bert = BertModel.from_pretrained('bert-base-uncased')
        self.classifier = nn.Linear(768, num_classes)
        self.dropout = nn.Dropout(0.1)

    def forward(self, input_ids, attention_mask, token_type_ids):
        outputs = self.bert(input_ids, attention_mask, token_type_ids)
        cls_output = outputs.last_hidden_state[:, 0, :]  # [batch, 768]
        cls_output = self.dropout(cls_output)
        logits = self.classifier(cls_output)  # [batch, num_classes]
        return logits

# Fine-tuning hyperparameters (from the paper)
# Learning rate: 2e-5, 3e-5, or 5e-5
# Batch size: 16 or 32
# Epochs: 2, 3, or 4
# That's it — BERT pre-training did the hard work

Fine-Tuning Task Visualizer

Switch between task types to see how BERT's output is adapted for each. Notice that the core BERT model is identical — only the output head changes.

Fine-tuning efficiency is remarkable. BERT-Base takes 4 days to pre-train on 16 TPU chips. But fine-tuning on a downstream task takes 1-2 hours on a single GPU. The pre-training cost is amortized — you pay once and fine-tune for free (relatively speaking). This was the first clear demonstration that the "pre-train once, fine-tune everywhere" paradigm could work for NLP.

Fine-tuning results

Benchmark	Previous SOTA	BERT-Large	Improvement
GLUE	72.8	80.5	+7.7 pts
SQuAD 1.1 (F1)	91.6	93.2	+1.6 pts
SQuAD 2.0 (F1)	78.0	83.1	+5.1 pts
MNLI (acc)	80.6	86.7	+6.1 pts
SST-2 (acc)	94.9	94.9	Tied

For question answering on SQuAD, how does BERT extract the answer span from a passage?

It generates the answer word by word using a decoder For each token in the passage, two linear classifiers predict the probability of being the answer start and answer end. The answer is the span between the highest-scoring start and end positions (with start ≤ end). It uses the [CLS] token to predict the answer as a classification over possible spans

Chapter 7: Ablation Studies

One of BERT's greatest contributions was its thorough ablation study. Instead of just presenting results, Devlin et al. systematically removed components to understand what actually matters. These ablations are a masterclass in scientific rigor for ML papers.

Effect of pre-training tasks

What happens when you remove each pre-training objective?

Model	MNLI	QNLI	MRPC	SST-2	SQuAD
BERT (MLM + NSP)	84.4	88.4	86.7	92.7	86.5
No NSP	83.9	84.9	86.5	92.6	87.9
LTR (no MLM, left-to-right)	82.1	84.3	77.5	92.1	77.8
LTR + BiLSTM	82.1	84.1	75.7	91.6	84.8

The critical comparison is BERT vs LTR (left-to-right). Replacing MLM with standard left-to-right language modeling — keeping everything else the same — drops performance dramatically, especially on tasks requiring token-level predictions (SQuAD: 86.5 → 77.8). This is the smoking gun: bidirectionality is the key ingredient.

Adding a BiLSTM on top of the LTR model helps on SQuAD (77.8 → 84.8) but hurts on MRPC (77.5 → 75.7) and doesn't help on MNLI. Shallow bidirectionality (BiLSTM) is a poor substitute for deep bidirectionality (MLM + Transformer encoder).

The strongest ablation finding: Removing MLM (replacing it with LTR) is far more damaging than removing NSP. MLM → LTR causes a 8.7 point drop on SQuAD, while removing NSP only causes a 0.5 point drop on MNLI. The bidirectional pre-training objective is the soul of BERT; NSP is optional.

Effect of model size

Does bigger always mean better? Devlin et al. trained models at multiple sizes:

Model	L	H	A	Params	MNLI	MRPC	SQuAD
3-layer	3	768	12	~45M	77.9	79.8	72.7
6-layer	6	768	12	~67M	80.6	82.2	79.8
BERT-Base	12	768	12	110M	84.4	86.7	86.5
BERT-Large	24	1024	16	340M	86.6	88.0	90.9

Clear scaling: more layers and wider hidden dimensions help across all tasks. The jump from Base to Large is substantial (MNLI: 84.4 → 86.6, SQuAD: 86.5 → 90.9). This was early evidence of what became the scaling laws paper (Kaplan et al., 2020): bigger models systematically perform better.

Effect of number of training steps

BERT was trained for 1M steps. What if you train for less?

Devlin et al. showed that BERT's MLM pre-training needs significantly more steps than LTR to converge. This makes sense: MLM only gets a training signal from 15% of tokens per step (the masked ones), while LTR gets a signal from 100% of tokens (predict next at every position). MLM is 6-7x less sample-efficient per step. But at convergence, MLM produces much better representations — the quality of the training signal matters more than the quantity.

Feature extraction vs fine-tuning

Must you fine-tune all of BERT, or can you freeze it and just use the representations? Devlin et al. tested using BERT as a fixed feature extractor for NER:

Strategy	CoNLL NER F1
Fine-tune all layers	96.4
Concat last 4 hidden layers (frozen)	96.1
Sum last 4 hidden layers (frozen)	95.9
Use last layer only (frozen)	95.6
Use second-to-last layer (frozen)	95.6

Surprisingly, feature extraction comes close to fine-tuning (96.1 vs 96.4). This means BERT's pre-trained representations are already excellent for NER — fine-tuning provides only marginal improvement. For tasks where fine-tuning is expensive or impossible (e.g., you want to use BERT features in a pipeline with other non-differentiable components), feature extraction is a viable alternative.

Ablation Impact Visualizer

Compare the impact of removing different BERT components on benchmark performance. Toggle components to see how each affects scores across tasks.

What the ablations tell us about the field's future: (1) Bidirectionality is worth fighting for — the GPT-style left-to-right approach pays a real cost for understanding tasks. (2) Scale helps monotonically — there's no diminishing returns within the range tested. (3) The pre-training objective matters far more than the specific downstream task architecture. These insights directly shaped RoBERTa, ALBERT, SpanBERT, and the entire BERT family that followed.

According to BERT's ablation studies, which single change causes the largest drop in performance?

Replacing MLM with left-to-right language modeling (LTR) — this causes a massive drop, especially on token-level tasks like SQuAD (86.5 → 77.8), proving that bidirectional pre-training is BERT's core contribution Removing Next Sentence Prediction (NSP) Reducing the model size from Large to Base

Chapter 8: BERT Explorer

Now let's bring everything together. This interactive simulation lets you explore BERT's full pipeline — from raw text through tokenization, embedding, multi-layer encoding, masking, and prediction. You'll see exactly how information flows through the network and how bidirectional context shapes each token's representation.

Complete BERT Pipeline Explorer

Watch text flow through BERT's entire pipeline. Click "Run" to process a sentence: tokenization → embedding → 12 Transformer layers → MLM prediction. Use the layer slider to inspect representations at different depths. Hover over tokens to see their attention patterns.

Layer 12

Click Run to start

What each layer learns

Research into BERT's internal representations (Tenney et al. 2019, "BERT Rediscovers the Classical NLP Pipeline") revealed a remarkable finding: BERT's layers form an implicit processing pipeline that mirrors the traditional NLP stack:

Layer Range	What It Captures	NLP Analogue
Layers 0-2	Surface features: word identity, position, basic syntax	POS tagging
Layers 3-5	Syntactic structure: dependency relations, phrase boundaries	Parsing
Layers 6-8	Semantic roles: who did what to whom	SRL
Layers 9-11	Task-specific features: coreference, relations	NER, relation extraction

This is striking because nobody taught BERT about POS tags, parse trees, or semantic roles. These representations emerge purely from the MLM objective — the model discovered that building these intermediate representations helps it predict masked words. It's a form of unsupervised feature learning where the features that emerge happen to align with what linguists identified manually over decades.

Attention head specialization

Individual attention heads in BERT learn specialized roles (Clark et al. 2019). Some notable patterns:

Head Type	Behavior	Example
Positional heads	Attend to adjacent positions	Head 2-0: always attends to next token
Separator heads	Attend to [SEP] tokens	Head 6-3: focuses on [SEP] as a "no-op"
Syntactic heads	Track dependency relations	Head 8-10: subject → verb attention
Coreference heads	Link pronouns to antecedents	Head 5-4: "he" → "John" attention

The [SEP] token as a "null attention" target: Many attention heads learn to attend strongly to [SEP] tokens when they have nothing meaningful to attend to. This is because [SEP] is present in every input, so attending to it is a safe "default" that doesn't corrupt the representation with irrelevant information. It functions as a learned no-op attention pattern.

Contextualization across layers

BERT's representations become increasingly contextualized as you go deeper. In the first layer, the representation of "bank" is nearly identical regardless of context. By layer 12, the representation of "bank" in "river bank" is very different from "bank" in "bank account." This progressive contextualization is what makes BERT's representations so powerful — they capture not just what a word is but what it means in this specific context.

python
# Probing BERT's layers: measuring contextual similarity
from transformers import BertModel, BertTokenizer
import torch

model = BertModel.from_pretrained('bert-base-uncased', output_hidden_states=True)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

sent1 = "I went to the bank to deposit money"
sent2 = "I sat on the bank of the river"

# Get all layer outputs
tok1 = tokenizer(sent1, return_tensors='pt')
tok2 = tokenizer(sent2, return_tensors='pt')
out1 = model(**tok1).hidden_states  # tuple of 13 tensors (emb + 12 layers)
out2 = model(**tok2).hidden_states

# "bank" is token index 5 in both
for layer in [0, 3, 6, 9, 12]:
    v1 = out1[layer][0, 5]  # "bank" in sentence 1
    v2 = out2[layer][0, 5]  # "bank" in sentence 2
    sim = torch.nn.functional.cosine_similarity(v1, v2, dim=0)
    print(f"Layer {layer}: cosine similarity = {sim:.3f}")
# Layer 0:  0.95  (nearly identical — not yet contextualized)
# Layer 3:  0.85  (starting to diverge)
# Layer 6:  0.72  (clearly different)
# Layer 9:  0.58  (very different)
# Layer 12: 0.42  (completely different — different meanings)

What does research show about what BERT's different layers learn?

Lower layers (0-2) capture surface/syntactic features, middle layers (3-8) capture syntactic structure and semantic roles, and upper layers (9-11) capture task-specific features — effectively recreating the classical NLP pipeline without any explicit supervision All layers learn the same features, just with increasing confidence Lower layers learn semantics and upper layers learn syntax

Chapter 9: Connections

BERT didn't emerge in isolation — it built on years of work in transfer learning and language modeling, and it spawned an entire family of successors that addressed its limitations.

What BERT built on

Predecessor	Contribution to BERT
Word2Vec (2013)	Showed that pre-training word representations on unlabeled text transfers to downstream tasks
GloVe (2014)	Global matrix factorization for word vectors — but still static (one vector per word)
Transformer (2017)	The encoder architecture BERT uses directly — multi-head self-attention without recurrence
ELMo (2018)	First contextualized word representations — but shallow bidirectionality (two separate LSTMs)
GPT-1 (2018)	Transformer-based pre-training + fine-tuning — but unidirectional (left-to-right only)
ULMFiT (2018)	Demonstrated that fine-tuning pre-trained LMs works well — BERT scaled this idea up

The BERT family

Successor	Key Improvement	Year
RoBERTa	More data (160GB), no NSP, dynamic masking, larger batches, longer training	2019
ALBERT	Parameter sharing across layers, factorized embedding, SOP replaces NSP	2020
SpanBERT	Masks contiguous spans instead of random tokens, no NSP	2020
DistilBERT	6-layer distilled version, 97% of BERT's performance at 60% the size	2019
DeBERTa	Disentangled attention (separate content and position), enhanced mask decoder	2021
XLNet	Permutation language modeling — bidirectional without masking	2019
ELECTRA	Replaced token detection instead of masked prediction — trains on all tokens	2020

BERT vs GPT: two philosophies

BERT and GPT represent two fundamentally different approaches to language AI, and the field eventually chose GPT's path:

Dimension	BERT (encoder)	GPT (decoder)
Direction	Bidirectional	Left-to-right
Pre-training	MLM (predict masked tokens)	LM (predict next token)
Adaptation	Fine-tune for each task	In-context learning / prompting
Generation	Cannot generate text naturally	Excellent at generation
Scaling	Saturated around 340M-1B	Scales to 100B+ with consistent gains
Paradigm	One model per task	One model for all tasks

Why GPT won: BERT excels at understanding but can't generate text — its bidirectionality makes autoregressive generation impossible. As the field shifted from classification tasks (where BERT shines) to generative tasks (chatbots, code generation, reasoning), GPT's unidirectional approach proved more versatile. You can do understanding tasks with GPT (via prompting), but you can't do generation with BERT. The GPT-3 → ChatGPT → GPT-4 trajectory showed that unidirectional models at sufficient scale can match or exceed BERT on understanding tasks while also excelling at generation. Scale compensated for the loss of bidirectionality.

BERT's lasting legacy

Even though GPT-style models dominate today, BERT's contributions remain foundational:

BERT proved three things that shaped the entire field:
1. Pre-train once, fine-tune everywhere — a single pre-trained model can be adapted to many tasks with minimal effort.
2. Scale + right objective = emergent capabilities — BERT spontaneously learned syntax, semantics, and world knowledge from raw text.
3. Bidirectional context matters for understanding — even if the field chose unidirectional models for their generality, BERT proved that bidirectionality provides genuinely better representations for comprehension tasks.

BERT-based models remain the default choice for production NLP tasks that don't require generation: search ranking (Google used BERT in its search engine starting 2019), sentiment analysis, entity extraction, text classification, and semantic similarity. When you need fast, accurate understanding of fixed text — not open-ended generation — BERT is still hard to beat.

"BERT is deeply bidirectional, GPT is unidirectional. There are advantages to both, but the big advantage of BERT is that it can be used for understanding tasks more effectively."
— Jacob Devlin

Why did the field ultimately favor GPT-style (decoder-only, unidirectional) models over BERT-style (encoder-only, bidirectional) models for general-purpose AI?

Because BERT is slower to train Because BERT has more parameters Because GPT-style models can both understand and generate text, while BERT can only understand — and at sufficient scale, GPT models match BERT on understanding tasks while also excelling at generation, making them more versatile

BERT: Bidirectional Transformers