Pre-training of Deep Bidirectional Transformers for Language Understanding — one model, pre-trained on unlabeled text, then fine-tuned for any NLP task.
Imagine you're reading the sentence: "The man went to the bank to deposit his check." What does "bank" mean? A financial institution, obviously. But now consider: "The man went to the bank to catch some fish." Same word, completely different meaning. How did you know? Because you read the entire sentence — including the words after "bank."
Before BERT, the dominant language model paradigm was left-to-right. Models like GPT read text strictly from left to right — when processing "bank," they could only see "The man went to the" — the five words before it. The critical disambiguating context ("deposit his check" vs "catch some fish") comes after the word, so a left-to-right model is flying blind.
This is a fundamental limitation. In natural language, meaning flows in both directions. The subject constrains the verb, but the verb also constrains the subject. The object constrains the preposition, but the preposition also constrains the object. A truly capable language understanding system needs to see context from both sides simultaneously.
Previous attempts at bidirectionality existed. ELMo (Peters et al., 2018) trained two separate LSTMs — one left-to-right and one right-to-left — and concatenated their hidden states. But this is "shallow" bidirectionality: the two directions never interact during encoding. The left-to-right LSTM doesn't know what the right-to-left LSTM is thinking, and vice versa.
What we really want is deep bidirectionality — a model where, at every layer, every token can attend to every other token in both directions. The Transformer encoder architecture provides exactly this: self-attention has no inherent directionality. Token 5 can attend to token 3 (left context) and token 7 (right context) with equal ease. The Transformer is naturally bidirectional — the bottleneck was never the architecture, but the training objective.
Click on a word to see what context a left-to-right model (top) vs a bidirectional model (bottom) can use. Notice how the bidirectional model sees the full sentence, resolving ambiguity that the unidirectional model cannot.
The problem is: how do you train a deep bidirectional model? You can't use standard language modeling (predict the next token) because in a bidirectional model, the "answer" would leak through the attention mechanism. If every token can attend to every other token, and you're trying to predict token 5, the model can simply look at position 5 and read the answer. The training signal would be trivial and the model would learn nothing.
This is the catch-22 that blocked bidirectional pre-training for years. GPT chose left-to-right and sacrificed bidirectionality. ELMo used shallow bidirectionality as a compromise. BERT found a different solution entirely: masking.
Instead of predicting the next token, BERT randomly masks some tokens and asks the model to predict them from context. Since the masked token is replaced with a special [MASK] placeholder, the model can't cheat — it must genuinely use the surrounding context (from both directions) to reconstruct the missing word. This is the Masked Language Model (MLM) objective, and it's the key innovation that makes deep bidirectional pre-training possible.
| Approach | Direction | Context for "bank" | Architecture |
|---|---|---|---|
| GPT | Left-to-right | "The man went to the" | Transformer decoder |
| ELMo | Shallow bidir | Two LSTMs, never interact | Stacked biLSTM |
| BERT | Deep bidir | Entire sentence, all layers | Transformer encoder |
BERT's impact was immediate and devastating to the status quo. It set new state-of-the-art results on 11 NLP benchmarks simultaneously upon release, improving the best previous results by large margins (e.g., 7.7% on GLUE, 4.6 F1 on SQuAD 2.0). It showed that a single pre-trained model, fine-tuned with minimal task-specific modifications, could outperform complex, task-specific architectures that had been engineered over years. This was the "ImageNet moment" for NLP.
BERT's primary pre-training objective is the Masked Language Model (MLM). The idea is deceptively simple: randomly mask some percentage of the input tokens, then train the model to predict the original tokens from the corrupted input.
Specifically, for each training sequence, BERT randomly selects 15% of the token positions for prediction. But there's a subtle twist in how these selected tokens are treated:
Why this 80/10/10 split? It addresses a critical problem: the mismatch between pre-training and fine-tuning.
During fine-tuning, the model never sees the [MASK] token — real inputs don't have masks. If the model learned to rely on the presence of [MASK] as a signal ("oh, there's a mask, I should predict something"), that skill wouldn't transfer to fine-tuning. The 10% random replacement forces the model to stay uncertain about which tokens are corrupted, so it must maintain a good representation of every token. The 10% unchanged tokens ensure the model learns that even "correct-looking" tokens may need prediction.
The MLM loss is simply the cross-entropy loss between the model's predicted distribution and the true token, summed only over the masked positions:
Where xi is the true token at position i, x\masked is the input with masked tokens replaced, and p(xi | x\masked) is the model's predicted probability for the correct token at position i.
Let's walk through a concrete example. Take the sentence "The cat sat on the mat" with 6 tokens. With 15% masking, we'd mask about 1 token (0.15 × 6 ≈ 1). Say we select "cat" (position 2):
python # MLM: one training step import torch import torch.nn.functional as F # Original tokens: [CLS] The cat sat on the mat [SEP] input_ids = [101, 1996, 4937, 2938, 2006, 1996, 13523, 102] # Masking "cat" (position 2) — with 80% prob, replace with [MASK]=103 masked_input = [101, 1996, 103, 2938, 2006, 1996, 13523, 102] # ^^^^ [MASK] replaces "cat" # labels: -100 means "don't compute loss here" labels = [-100, -100, 4937, -100, -100, -100, -100, -100] # ^^^^ only compute loss at masked position # Forward pass: model outputs logits [8, vocab_size=30522] logits = model(torch.tensor([masked_input])) # [1, 8, 30522] # Loss at position 2: cross-entropy between logits[2] and true id 4937 loss = F.cross_entropy( logits[0, 2, :], # predicted distribution over 30522 vocab torch.tensor(4937) # true token id for "cat" ) # loss ≈ 5.2 (early training) → ≈ 0.3 (late training)
Notice that the model receives the full bidirectional context ("The [MASK] sat on the mat") and must predict "cat" from that context. It sees "sat" and "on the mat" to the right, and "The" to the left. Both directions contribute to the prediction. This is exactly the deep bidirectional conditioning that GPT-style models can't do.
Type a sentence and click "Mask & Predict" to see which tokens get masked (15% rate) and what a model might predict. The bar chart shows the model's confidence distribution over candidate words. Try different sentences to see how context from both sides helps prediction.
MLM is, mathematically, a denoising autoencoder. The input is corrupted (masking), passed through the model, and the model must reconstruct the original. This connection to the autoencoder literature (Vincent et al., 2008) was not an accident — Devlin et al. explicitly drew on this tradition. The key difference is that BERT operates on discrete tokens (not continuous pixels) and uses cross-entropy loss (not reconstruction error).
The denoising perspective also explains why MLM works so well as a pre-training objective: by learning to reconstruct corrupted text, the model must build rich internal representations that capture syntax, semantics, and world knowledge. A model that can correctly predict "[MASK] sat on the mat" → "cat" must know that cats sit on mats, that "sat" is past tense, that articles precede nouns — all learned implicitly from the reconstruction task.
One downside of MLM compared to standard language modeling: only 15% of tokens contribute to the training loss per step. In a left-to-right language model like GPT, every token contributes — the model predicts the next token at every position. This means MLM needs roughly 6-7x more training steps to see the same number of prediction tasks. BERT compensates with a larger batch size (256 sequences) and longer training (1M steps), but this inefficiency is a real cost.
Many important NLP tasks — question answering, natural language inference, paraphrase detection — require understanding the relationship between two sentences, not just individual sentence meaning. MLM trains the model to understand tokens in context, but it doesn't explicitly teach the model about inter-sentence relationships.
BERT's second pre-training objective addresses this: Next Sentence Prediction (NSP). The idea is straightforward. Given two sentences A and B, the model must predict whether B actually follows A in the original text (label: IsNext) or whether B is a random sentence from the corpus (label: NotNext).
The input format packs both sentences into a single sequence with special tokens:
[CLS] is a special classification token prepended to every input. Its final hidden state is used as the "aggregate sequence representation" for classification tasks. [SEP] is a separator token that marks the boundary between sentences.
The NSP prediction uses the final hidden state of the [CLS] token, fed through a linear layer with softmax:
Where h[CLS] is the d-dimensional hidden state of the [CLS] token at the final layer, and W is a 2 × d weight matrix. The loss is binary cross-entropy.
python # NSP training example # Positive pair (IsNext): sent_a = "The cat sat on the mat." sent_b = "It purred softly in the sunlight." label = 1 # IsNext # Negative pair (NotNext): sent_a = "The cat sat on the mat." sent_b = "Quantum entanglement enables teleportation." label = 0 # NotNext # Input to BERT: [CLS] sent_a [SEP] sent_b [SEP] # Token type IDs: 0 0...0 0 1...1 1 # The type IDs tell BERT which tokens belong to sentence A vs B # Forward pass cls_hidden = model(input_ids, token_type_ids)[0][:, 0, :] # [batch, 768] nsp_logits = nsp_head(cls_hidden) # [batch, 2] nsp_loss = F.cross_entropy(nsp_logits, labels)
See how BERT processes a sentence pair for NSP. The [CLS] token aggregates information from both sentences through self-attention to make the IsNext/NotNext prediction. Click "Shuffle B" to swap sentence B for a random one and see the prediction change.
NSP was perhaps the most debated design choice in BERT. Later papers showed it may actually hurt performance:
| Paper | Finding |
|---|---|
| RoBERTa (Liu et al., 2019) | Removing NSP improves results on most benchmarks. NSP's benefit came from the sentence-pair formatting, not the objective itself. |
| ALBERT (Lan et al., 2020) | Replaced NSP with Sentence Order Prediction (SOP) — predict whether A-B or B-A is the correct order. SOP is harder and more useful. |
| SpanBERT (Joshi et al., 2020) | Removing NSP and using single-sentence inputs with span masking outperforms BERT on most tasks. |
The consensus that emerged: NSP is too easy. When the negative pairs are random sentences from different documents, they're so obviously unrelated that the model can solve NSP by detecting topic similarity alone, without learning genuine discourse coherence. SOP (same document, swapped order) forces the model to learn actual sentence ordering, which is a harder and more useful skill.
BERT's total pre-training loss is simply the sum of the MLM and NSP losses:
Both losses are computed on every training example. The MLM loss provides the rich token-level understanding signal, while NSP provides a weaker sentence-level signal. In practice, MLM does most of the heavy lifting — which is why removing NSP doesn't hurt much.
BERT's representations are only as good as the data they're learned from. Devlin et al. used two unlabeled text corpora totaling approximately 3.3 billion words:
| Corpus | Size | Description |
|---|---|---|
| BooksCorpus | ~800M words | 11,038 unpublished books from smashwords.com. Long-form, coherent text across many genres. Critical for learning long-range dependencies and narrative structure. |
| English Wikipedia | ~2,500M words | Text content only (no tables, lists, or headers). Covers factual knowledge across all domains. Provides encyclopedic breadth. |
By 2018 standards, 3.3 billion words was substantial but not extreme. For context, GPT-2 (released months later) trained on 8 billion words, and GPT-3 trained on 300 billion tokens. BERT showed that even with relatively modest data, the right training objective (MLM + bidirectionality) could produce remarkable results.
Each training input is a pair of "sentences" (actually text spans) sampled from the corpus. The preprocessing pipeline works as follows:
The maximum sequence length is 512 tokens (including [CLS] and [SEP] tokens). This was a GPU memory constraint in 2018 — longer sequences require quadratically more memory for the attention matrix (512² = 262K entries per head per layer). Modern models have pushed this to 2K, 4K, 8K, and beyond using techniques like FlashAttention and sparse attention.
BERT uses a clever training efficiency trick: for the first 90% of training steps, the maximum sequence length is reduced to 128 tokens. Only the final 10% of training uses the full 512-token length. This works because:
1. Short sequences are much cheaper: 128² / 512² = 1/16 the attention cost.
2. Most of language understanding can be learned from local context (nearby words).
3. The final 10% at 512 tokens teaches long-range dependencies without paying full cost for the entire training run.
This trick alone reduced total training time by approximately 40% with negligible performance loss.
Watch how a raw document is processed into BERT training examples. Each step shows the transformation. Click "Process" to animate the pipeline for a new document.
| Parameter | Value | Rationale |
|---|---|---|
| Batch size | 256 sequences | Large batch for stable gradients |
| Steps | 1,000,000 | ~40 epochs over the 3.3B word corpus |
| Optimizer | Adam (β₁=0.9, β₂=0.999) | Standard adaptive optimizer |
| Learning rate | 1e-4 (with linear warmup) | 10,000 warmup steps, then linear decay |
| Dropout | 0.1 on all layers | Regularization against overfitting |
| Weight decay | 0.01 (L2) | Additional regularization |
| Hardware | 4 Cloud TPUs (16 TPU chips) | BERT-Base: 4 days, BERT-Large: ~4 days on 16 TPUs |
BERT uses the Transformer encoder — specifically, just the encoder half of the original Transformer (Vaswani et al., 2017). No decoder, no cross-attention, no autoregressive masking. Pure bidirectional self-attention, stacked into deep layers.
Two model sizes were released:
| Parameter | BERT-Base | BERT-Large |
|---|---|---|
| Layers (L) | 12 | 24 |
| Hidden size (H) | 768 | 1024 |
| Attention heads (A) | 12 | 16 |
| Head dimension (H/A) | 64 | 64 |
| FFN intermediate | 3072 (4×H) | 4096 (4×H) |
| Total parameters | 110M | 340M |
| Vocab size | 30,522 | 30,522 |
| Max sequence length | 512 | 512 |
BERT-Base was deliberately sized to match GPT-1 (also 12 layers, 768 hidden, 110M parameters) to enable a fair comparison: same capacity, different training objective (bidirectional MLM vs unidirectional LM). BERT-Base outperformed GPT-1 on every benchmark, demonstrating that the bidirectional objective was the key ingredient, not model size.
Every layer in BERT follows the same structure:
The residual connections and layer normalization follow the original Transformer design. Each sub-layer (attention and FFN) has a residual connection around it, followed by layer normalization:
This is the "post-norm" pattern from the original Transformer. (Later models like GPT-2 switched to "pre-norm": output = x + SubLayer(LayerNorm(x)), which is more stable for deep networks.)
BERT's input is constructed from three embedding types, summed together:
| Embedding | Purpose | Details |
|---|---|---|
| Token embedding | Maps each WordPiece token to a vector | Learned, 30,522 × 768 |
| Segment embedding | Distinguishes sentence A from sentence B | Learned, 2 × 768 (only two segments) |
| Position embedding | Encodes position in the sequence | Learned, 512 × 768 (not sinusoidal) |
Unlike the original Transformer which used fixed sinusoidal position encodings, BERT uses learned position embeddings. Each of the 512 possible positions has its own learned 768-dimensional vector. This means BERT cannot handle sequences longer than 512 tokens — there's no position embedding for position 513. Later models (RoPE, ALiBi) solved this with relative or extrapolatable position encodings.
Click on different parts of the BERT architecture to see data flow and tensor shapes at each stage. The three embedding types (token, segment, position) are summed and fed through L=12 Transformer encoder layers.
BERT was one of the first major models to use the GELU (Gaussian Error Linear Unit) activation function instead of ReLU in the FFN layers:
Where Φ(x) is the standard Gaussian CDF. GELU is smooth (no sharp kink at 0 like ReLU) and stochastically gates the input based on its magnitude. Larger positive values pass through almost unchanged; negative values are suppressed but not completely zeroed (unlike ReLU). This smoother behavior is believed to help optimization in deep Transformer networks.
python import torch import torch.nn as nn class BertLayer(nn.Module): def __init__(self, H=768, A=12, intermediate=3072): super().__init__() # Multi-head self-attention self.attn = nn.MultiheadAttention(H, A, batch_first=True) self.ln1 = nn.LayerNorm(H) # Feed-forward network self.ffn = nn.Sequential( nn.Linear(H, intermediate), # 768 → 3072 nn.GELU(), # smooth activation nn.Linear(intermediate, H), # 3072 → 768 ) self.ln2 = nn.LayerNorm(H) self.dropout = nn.Dropout(0.1) def forward(self, x): # x: [batch, seq_len, 768] attn_out, _ = self.attn(x, x, x) # self-attention x = self.ln1(x + self.dropout(attn_out)) # residual + norm ffn_out = self.ffn(x) x = self.ln2(x + self.dropout(ffn_out)) # residual + norm return x # [batch, seq_len, 768]
How do you represent text as numbers for a neural network? The naive approach — one token per word — fails for two reasons. First, the vocabulary would be enormous (English has ~170,000 words in common use, plus names, technical terms, and neologisms). Second, any word not in the vocabulary (an "out-of-vocabulary" or OOV word) can't be processed at all.
BERT uses WordPiece tokenization (Schuster & Nakajima, 2012), a subword method that strikes a balance between character-level and word-level tokenization. The key idea: common words stay as single tokens, but rare words are split into smaller subword pieces.
| Input Word | WordPiece Tokens | Why |
|---|---|---|
| "the" | ["the"] | Common word → single token |
| "cat" | ["cat"] | Common word → single token |
| "playing" | ["play", "##ing"] | Splits into stem + suffix |
| "unbelievable" | ["un", "##bel", "##iev", "##able"] | Rare word → subword pieces |
| "transformers" | ["transform", "##ers"] | Stem is common enough |
| "xyzzy123" | ["x", "##y", "##z", "##zy", "##12", "##3"] | Unknown word → character fallback |
The "##" prefix indicates a continuation piece — a subword that's not at the start of a word. "play" is a word-initial piece; "##ing" is a continuation. This lets the model distinguish between "playing" (["play", "##ing"]) and "play" + "ing" as separate words.
The WordPiece vocabulary is constructed through an iterative greedy algorithm similar to Byte Pair Encoding (BPE):
The key difference between WordPiece and BPE is the scoring function. BPE simply merges the most frequent pair. WordPiece merges the pair with the highest mutual information — the pair where combining them provides the most information beyond what each piece provides alone. This tends to produce linguistically more meaningful subwords.
Type a word and see how WordPiece breaks it into subword tokens. Common words remain whole; rare words are split. The "##" prefix marks continuation pieces.
BERT's vocabulary of 30,522 tokens covers English text efficiently. The average English word is split into 1.1-1.5 WordPiece tokens, so the effective sequence length (in words) is roughly 340-465 words for a 512-token input. This is plenty for most NLP tasks — GLUE benchmarks have average input lengths of 20-60 words.
| Token | ID | Purpose |
|---|---|---|
| [CLS] | 101 | Prepended to every input. Its final hidden state is the sequence representation for classification. |
| [SEP] | 102 | Separates sentence A from sentence B. Also appended to the end. |
| [MASK] | 103 | Replaces tokens during MLM pre-training. |
| [PAD] | 0 | Padding for sequences shorter than max_length. |
| [UNK] | 100 | Fallback for characters not in the vocabulary (rare with WordPiece). |
This is where BERT's design pays off spectacularly. After pre-training on 3.3 billion words of unlabeled text, BERT can be fine-tuned for virtually any NLP task with minimal architectural changes — often just a single output layer on top of the pre-trained model.
The fine-tuning recipe is always the same:
The key insight: all parameters are fine-tuned, not just the task head. The pre-trained weights shift slightly to adapt to the task, but they retain the vast majority of their pre-trained knowledge. This is why fine-tuning works with so little data — the model doesn't need to learn English from scratch; it just needs to learn the task-specific mapping.
BERT handles four major task types with minimal modifications:
Use the [CLS] token's final hidden state as the sentence representation. Add a linear layer: h[CLS] → logits.
Input: [CLS] sentence A [SEP] sentence B [SEP]. Same as above — use h[CLS] for classification.
Input: [CLS] question [SEP] passage [SEP]. For each passage token, predict two things: "is this the start of the answer?" and "is this the end of the answer?"
The answer span is the highest-scoring (start, end) pair where start ≤ end.
For each token, predict its entity tag (Person, Organization, Location, or None). Use each token's final hidden state with a linear classifier.
python # Fine-tuning BERT for sentence classification from transformers import BertModel import torch.nn as nn class BertClassifier(nn.Module): def __init__(self, num_classes=2): super().__init__() self.bert = BertModel.from_pretrained('bert-base-uncased') self.classifier = nn.Linear(768, num_classes) self.dropout = nn.Dropout(0.1) def forward(self, input_ids, attention_mask, token_type_ids): outputs = self.bert(input_ids, attention_mask, token_type_ids) cls_output = outputs.last_hidden_state[:, 0, :] # [batch, 768] cls_output = self.dropout(cls_output) logits = self.classifier(cls_output) # [batch, num_classes] return logits # Fine-tuning hyperparameters (from the paper) # Learning rate: 2e-5, 3e-5, or 5e-5 # Batch size: 16 or 32 # Epochs: 2, 3, or 4 # That's it — BERT pre-training did the hard work
Switch between task types to see how BERT's output is adapted for each. Notice that the core BERT model is identical — only the output head changes.
| Benchmark | Previous SOTA | BERT-Large | Improvement |
|---|---|---|---|
| GLUE | 72.8 | 80.5 | +7.7 pts |
| SQuAD 1.1 (F1) | 91.6 | 93.2 | +1.6 pts |
| SQuAD 2.0 (F1) | 78.0 | 83.1 | +5.1 pts |
| MNLI (acc) | 80.6 | 86.7 | +6.1 pts |
| SST-2 (acc) | 94.9 | 94.9 | Tied |
One of BERT's greatest contributions was its thorough ablation study. Instead of just presenting results, Devlin et al. systematically removed components to understand what actually matters. These ablations are a masterclass in scientific rigor for ML papers.
What happens when you remove each pre-training objective?
| Model | MNLI | QNLI | MRPC | SST-2 | SQuAD |
|---|---|---|---|---|---|
| BERT (MLM + NSP) | 84.4 | 88.4 | 86.7 | 92.7 | 86.5 |
| No NSP | 83.9 | 84.9 | 86.5 | 92.6 | 87.9 |
| LTR (no MLM, left-to-right) | 82.1 | 84.3 | 77.5 | 92.1 | 77.8 |
| LTR + BiLSTM | 82.1 | 84.1 | 75.7 | 91.6 | 84.8 |
The critical comparison is BERT vs LTR (left-to-right). Replacing MLM with standard left-to-right language modeling — keeping everything else the same — drops performance dramatically, especially on tasks requiring token-level predictions (SQuAD: 86.5 → 77.8). This is the smoking gun: bidirectionality is the key ingredient.
Adding a BiLSTM on top of the LTR model helps on SQuAD (77.8 → 84.8) but hurts on MRPC (77.5 → 75.7) and doesn't help on MNLI. Shallow bidirectionality (BiLSTM) is a poor substitute for deep bidirectionality (MLM + Transformer encoder).
Does bigger always mean better? Devlin et al. trained models at multiple sizes:
| Model | L | H | A | Params | MNLI | MRPC | SQuAD |
|---|---|---|---|---|---|---|---|
| 3-layer | 3 | 768 | 12 | ~45M | 77.9 | 79.8 | 72.7 |
| 6-layer | 6 | 768 | 12 | ~67M | 80.6 | 82.2 | 79.8 |
| BERT-Base | 12 | 768 | 12 | 110M | 84.4 | 86.7 | 86.5 |
| BERT-Large | 24 | 1024 | 16 | 340M | 86.6 | 88.0 | 90.9 |
Clear scaling: more layers and wider hidden dimensions help across all tasks. The jump from Base to Large is substantial (MNLI: 84.4 → 86.6, SQuAD: 86.5 → 90.9). This was early evidence of what became the scaling laws paper (Kaplan et al., 2020): bigger models systematically perform better.
BERT was trained for 1M steps. What if you train for less?
Devlin et al. showed that BERT's MLM pre-training needs significantly more steps than LTR to converge. This makes sense: MLM only gets a training signal from 15% of tokens per step (the masked ones), while LTR gets a signal from 100% of tokens (predict next at every position). MLM is 6-7x less sample-efficient per step. But at convergence, MLM produces much better representations — the quality of the training signal matters more than the quantity.
Must you fine-tune all of BERT, or can you freeze it and just use the representations? Devlin et al. tested using BERT as a fixed feature extractor for NER:
| Strategy | CoNLL NER F1 |
|---|---|
| Fine-tune all layers | 96.4 |
| Concat last 4 hidden layers (frozen) | 96.1 |
| Sum last 4 hidden layers (frozen) | 95.9 |
| Use last layer only (frozen) | 95.6 |
| Use second-to-last layer (frozen) | 95.6 |
Surprisingly, feature extraction comes close to fine-tuning (96.1 vs 96.4). This means BERT's pre-trained representations are already excellent for NER — fine-tuning provides only marginal improvement. For tasks where fine-tuning is expensive or impossible (e.g., you want to use BERT features in a pipeline with other non-differentiable components), feature extraction is a viable alternative.
Compare the impact of removing different BERT components on benchmark performance. Toggle components to see how each affects scores across tasks.
Now let's bring everything together. This interactive simulation lets you explore BERT's full pipeline — from raw text through tokenization, embedding, multi-layer encoding, masking, and prediction. You'll see exactly how information flows through the network and how bidirectional context shapes each token's representation.
Watch text flow through BERT's entire pipeline. Click "Run" to process a sentence: tokenization → embedding → 12 Transformer layers → MLM prediction. Use the layer slider to inspect representations at different depths. Hover over tokens to see their attention patterns.
Research into BERT's internal representations (Tenney et al. 2019, "BERT Rediscovers the Classical NLP Pipeline") revealed a remarkable finding: BERT's layers form an implicit processing pipeline that mirrors the traditional NLP stack:
| Layer Range | What It Captures | NLP Analogue |
|---|---|---|
| Layers 0-2 | Surface features: word identity, position, basic syntax | POS tagging |
| Layers 3-5 | Syntactic structure: dependency relations, phrase boundaries | Parsing |
| Layers 6-8 | Semantic roles: who did what to whom | SRL |
| Layers 9-11 | Task-specific features: coreference, relations | NER, relation extraction |
This is striking because nobody taught BERT about POS tags, parse trees, or semantic roles. These representations emerge purely from the MLM objective — the model discovered that building these intermediate representations helps it predict masked words. It's a form of unsupervised feature learning where the features that emerge happen to align with what linguists identified manually over decades.
Individual attention heads in BERT learn specialized roles (Clark et al. 2019). Some notable patterns:
| Head Type | Behavior | Example |
|---|---|---|
| Positional heads | Attend to adjacent positions | Head 2-0: always attends to next token |
| Separator heads | Attend to [SEP] tokens | Head 6-3: focuses on [SEP] as a "no-op" |
| Syntactic heads | Track dependency relations | Head 8-10: subject → verb attention |
| Coreference heads | Link pronouns to antecedents | Head 5-4: "he" → "John" attention |
BERT's representations become increasingly contextualized as you go deeper. In the first layer, the representation of "bank" is nearly identical regardless of context. By layer 12, the representation of "bank" in "river bank" is very different from "bank" in "bank account." This progressive contextualization is what makes BERT's representations so powerful — they capture not just what a word is but what it means in this specific context.
python # Probing BERT's layers: measuring contextual similarity from transformers import BertModel, BertTokenizer import torch model = BertModel.from_pretrained('bert-base-uncased', output_hidden_states=True) tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') sent1 = "I went to the bank to deposit money" sent2 = "I sat on the bank of the river" # Get all layer outputs tok1 = tokenizer(sent1, return_tensors='pt') tok2 = tokenizer(sent2, return_tensors='pt') out1 = model(**tok1).hidden_states # tuple of 13 tensors (emb + 12 layers) out2 = model(**tok2).hidden_states # "bank" is token index 5 in both for layer in [0, 3, 6, 9, 12]: v1 = out1[layer][0, 5] # "bank" in sentence 1 v2 = out2[layer][0, 5] # "bank" in sentence 2 sim = torch.nn.functional.cosine_similarity(v1, v2, dim=0) print(f"Layer {layer}: cosine similarity = {sim:.3f}") # Layer 0: 0.95 (nearly identical — not yet contextualized) # Layer 3: 0.85 (starting to diverge) # Layer 6: 0.72 (clearly different) # Layer 9: 0.58 (very different) # Layer 12: 0.42 (completely different — different meanings)
BERT didn't emerge in isolation — it built on years of work in transfer learning and language modeling, and it spawned an entire family of successors that addressed its limitations.
| Predecessor | Contribution to BERT |
|---|---|
| Word2Vec (2013) | Showed that pre-training word representations on unlabeled text transfers to downstream tasks |
| GloVe (2014) | Global matrix factorization for word vectors — but still static (one vector per word) |
| Transformer (2017) | The encoder architecture BERT uses directly — multi-head self-attention without recurrence |
| ELMo (2018) | First contextualized word representations — but shallow bidirectionality (two separate LSTMs) |
| GPT-1 (2018) | Transformer-based pre-training + fine-tuning — but unidirectional (left-to-right only) |
| ULMFiT (2018) | Demonstrated that fine-tuning pre-trained LMs works well — BERT scaled this idea up |
| Successor | Key Improvement | Year |
|---|---|---|
| RoBERTa | More data (160GB), no NSP, dynamic masking, larger batches, longer training | 2019 |
| ALBERT | Parameter sharing across layers, factorized embedding, SOP replaces NSP | 2020 |
| SpanBERT | Masks contiguous spans instead of random tokens, no NSP | 2020 |
| DistilBERT | 6-layer distilled version, 97% of BERT's performance at 60% the size | 2019 |
| DeBERTa | Disentangled attention (separate content and position), enhanced mask decoder | 2021 |
| XLNet | Permutation language modeling — bidirectional without masking | 2019 |
| ELECTRA | Replaced token detection instead of masked prediction — trains on all tokens | 2020 |
BERT and GPT represent two fundamentally different approaches to language AI, and the field eventually chose GPT's path:
| Dimension | BERT (encoder) | GPT (decoder) |
|---|---|---|
| Direction | Bidirectional | Left-to-right |
| Pre-training | MLM (predict masked tokens) | LM (predict next token) |
| Adaptation | Fine-tune for each task | In-context learning / prompting |
| Generation | Cannot generate text naturally | Excellent at generation |
| Scaling | Saturated around 340M-1B | Scales to 100B+ with consistent gains |
| Paradigm | One model per task | One model for all tasks |
Even though GPT-style models dominate today, BERT's contributions remain foundational:
BERT-based models remain the default choice for production NLP tasks that don't require generation: search ranking (Google used BERT in its search engine starting 2019), sentiment analysis, entity extraction, text classification, and semantic similarity. When you need fast, accurate understanding of fixed text — not open-ended generation — BERT is still hard to beat.
"BERT is deeply bidirectional, GPT is unidirectional. There are advantages to both, but the big advantage of BERT is that it can be used for understanding tasks more effectively."
— Jacob Devlin