CS224N Lecture 7 — Pre-training at Scale

Chapter 0: Why Pre-train?

Training from scratch for every task is like rebuilding a car engine every time you drive somewhere new. Want to classify movie reviews? Train a Transformer on 10,000 labeled reviews from random initialization. Want to detect spam? Start from scratch again, another 10,000 labels. Want to answer questions? Throw away everything you learned and start over. Each model sees only its tiny labeled dataset and knows nothing about language in general.

This is wildly inefficient. A child who learns to read doesn't re-learn the alphabet for every new book. They build a general understanding of language — grammar, vocabulary, common sense — and then apply it to any new text. Can we give neural networks the same head start?

The answer is pre-training: train one massive model on billions of words of text, learning the statistical structure of language itself. This model doesn't solve any particular task — it learns representations that are useful for ALL tasks. Then, for each specific task, you take this pre-trained model and fine-tune it on a small labeled dataset. The model already knows what words mean, how grammar works, and what the world looks like through text. It just needs a nudge to solve your particular problem.

The economics are dramatic. Pre-training costs millions of dollars and takes months on thousands of GPUs. But you only do it once. Fine-tuning costs a few hundred dollars and takes hours on a single GPU. One expensive foundation supports hundreds of cheap adaptations.

The Pre-train + Fine-tune Paradigm

Before pre-training, every NLP task trained from random initialization. The model started knowing nothing — every weight was a random number. All knowledge had to come from the task-specific labeled data. If you only had 5,000 labeled examples (common in NLP), the model could barely learn the basics before overfitting.

Pre-training flips this. The model starts with knowledge of language, learned from unlabeled text (which is essentially free — the internet produces trillions of words). Fine-tuning just steers this knowledge toward a specific task. Even with 100 labeled examples, a pre-trained model can perform well because it already understands language. The labeled data just teaches it the task format.

The simulation below shows this difference. On the left, N tasks each train from random initialization — slow convergence, each model learns in isolation. On the right, one model is pre-trained on massive data, then quickly fine-tuned into N specialized models. Click to toggle between paradigms and watch the cost bars animate.

Random Init vs. Pre-train + Fine-tune

Click "Toggle" to switch paradigms. Left: each task trains from scratch. Right: one pre-trained model forks into fine-tuned heads.

Mode: From Scratch

ImageNet did this for vision in 2012. Pre-train a CNN on 1.2 million labeled images, then fine-tune for any vision task. NLP waited until 2018 because language is harder to self-supervise — images have natural spatial structure, but what's the "label" for a sentence? The breakthrough was realizing that predicting missing or next words IS the supervision signal. No labels needed.

Why 2018? What Changed?

Three things converged in 2018 to make pre-training work for NLP:

Factor	Before 2018	After 2018
Architecture	LSTMs (sequential, slow)	Transformers (parallel, scalable)
Compute	Single GPU, days	TPU pods / GPU clusters, weeks
Self-supervision	Word2Vec (static, shallow)	MLM / autoregressive (contextual, deep)

The Transformer (2017) made it possible to train on massive data efficiently. TPU pods at Google made it affordable (barely). And the key insight — that predicting masked or next words provides a rich supervisory signal — made it work. ELMo (Feb 2018), GPT (Jun 2018), and BERT (Oct 2018) all arrived within months of each other. The race was on.

What this lesson covers: How pre-training works from static embeddings to BERT to GPT to modern LLMs. Masking strategies, autoregressive generation, scaling laws, data pipelines, and the full training loop. By the end, you'll understand how foundation models are built from raw text to working system.

Why is pre-training more efficient than training from scratch for each task?

Pre-trained models have fewer parameters The expensive training on massive data happens once; fine-tuning is cheap Pre-training doesn't require GPUs

Chapter 1: Static to Contextual

"Bank" means different things in "river bank" and "bank account" — but Word2Vec gives them the same vector. One fixed point in embedding space, regardless of context. This is called the polysemy problem: many words have multiple meanings, and a single static vector can't capture them all.

Word2Vec and GloVe were revolutionary in 2013-2014. They proved that you could learn useful word representations from unlabeled text. But they had a fundamental limitation: each word gets exactly ONE vector, computed once and frozen. The vector for "bank" is the average of all its uses — part financial institution, part riverbed, part pool-shot cushion. It's a compromise that doesn't fully represent any of them.

The Problem with Static Embeddings

Consider these sentences:

Sentence A

"I deposited money at the bank."

↓

Sentence B

"We sat on the bank of the river."

↓

Word2Vec

Both "bank" tokens → identical vector [0.3, -0.1, 0.7, ...]

A model using Word2Vec embeddings sees identical inputs for these two completely different meanings. It must rely entirely on the downstream task model (an LSTM, a classifier) to figure out which meaning is intended. That's asking a lot of a small task-specific model.

Contextual Embeddings: The Solution

What if the embedding itself depended on the surrounding words? Instead of one fixed vector per word type, we compute a different vector for each word token (each occurrence in context). "Bank" next to "money" and "deposited" should produce a finance-flavored vector. "Bank" next to "river" and "sat" should produce a geography-flavored vector.

This is what contextual embeddings do. A deep model (LSTM in ELMo, Transformer in BERT/GPT) reads the entire sentence and produces a representation for each token that depends on every other token. The same word gets different vectors in different contexts.

h_i = f(x₁, x₂, ..., x_n; i)

The representation h_i for position i depends on ALL input tokens x₁ through x_n, not just x_i alone. This is a function of the entire sequence, evaluated at position i.

ELMo: The First Contextual Embeddings (Feb 2018)

ELMo (Embeddings from Language Models) was the first widely successful contextual embedding method. It trained a bidirectional LSTM language model on 1 billion words, then used the internal hidden states as features. The key insight: different layers of the LSTM capture different types of information.

Layer	What it captures	Useful for
Layer 0 (char CNN)	Morphology, spelling	POS tagging, NER
Layer 1 (LSTM)	Syntax, grammar	Parsing, chunking
Layer 2 (LSTM)	Semantics, meaning	Sentiment, QA

ELMo's approach was simple: run the pre-trained LSTM, extract hidden states from all layers, and let the downstream task learn a weighted combination. Different tasks weight different layers — syntax-heavy tasks prefer Layer 1, semantics-heavy tasks prefer Layer 2. This was the first proof that pre-trained representations transfer across tasks.

The simulation below shows the polysemy problem and its solution. Select a word, then see its static embedding as a fixed point and its contextual embeddings as different points that move based on sentence context.

Static vs. Contextual Embeddings

Click a word to select it. The orange dot is the static (Word2Vec) embedding. Teal dots are contextual embeddings — different for each sentence.

Click a word to explore its meanings.

This is THE problem that motivated ELMo, BERT, and GPT. Static embeddings collapse all meanings of a word into one vector. Contextual embeddings give each occurrence its own vector shaped by surrounding words. This single insight — that representations should be context-dependent — triggered the pre-training revolution.

From Feature Extraction to Fine-tuning

ELMo used the pre-trained model as a feature extractor — freeze the LSTM weights, extract hidden states, feed them to a task-specific model. This works but limits how much the representations can adapt to the task.

BERT and GPT went further: fine-tune the entire model. Update all the pre-trained weights on the task-specific data. This lets the representations themselves adapt to the task, not just the task head on top. Fine-tuning consistently outperforms feature extraction because the model can reshape its internal representations to focus on what the task needs.

What is the key limitation of static embeddings like Word2Vec?

They require labeled data to train They are too slow to compute Each word gets one vector regardless of context, so polysemy is lost

Chapter 2: BERT — Masked Language Modeling

Read this sentence: "The cat sat on the ___." You use both sides — "The cat sat on the" and the period at the end — to guess the blank. You consider left AND right context simultaneously. That's exactly how BERT learns.

BERT (Bidirectional Encoder Representations from Transformers) was published by Google in October 2018 and immediately demolished every NLP benchmark. Its key innovation: Masked Language Modeling (MLM). Take a sentence, randomly mask 15% of the tokens, and train the model to predict them from the surrounding context — both left and right.

How Masked Language Modeling Works

Given an input sentence, BERT's training procedure is:

Step 1: Select

Randomly choose 15% of tokens for prediction

↓

Step 2: Corrupt

80% → [MASK], 10% → random token, 10% → keep original

↓

Step 3: Encode

Run full bidirectional Transformer over corrupted sequence

↓

Step 4: Predict

At masked positions, predict original token via cross-entropy loss

The 80/10/10 split is not arbitrary. It solves a subtle problem.

Why 80/10/10? The Train-Test Mismatch

If BERT always replaced selected tokens with [MASK], it would learn to only pay attention when it sees [MASK]. But during fine-tuning, there are no [MASK] tokens — the model sees real text. This train-test mismatch would degrade performance.

The solution: 80% of the time, replace with [MASK] (the model must use context). 10% of the time, replace with a random word (the model must detect errors). 10% of the time, keep the original word (the model must learn that correct words are also possible answers). This way, the model doesn't rely on the special [MASK] token and learns robust representations even for unmasked positions.

L_MLM = − ∑_{i ∈ masked} log P(x_i | x_\i; θ)

The loss sums over only the masked positions. x_\i means "all tokens except position i" — bidirectional context. This is the key difference from GPT: BERT sees both left AND right context when predicting each masked token.

BERT's Second Objective: Next Sentence Prediction

BERT also trained on Next Sentence Prediction (NSP): given two sentences A and B, predict whether B actually follows A in the original text or is a random sentence. The [CLS] token's output is used for this binary classification. Later work (RoBERTa) showed that NSP actually hurts performance and dropped it. The key pre-training signal is MLM alone.

BERT Architecture Details

Variant	Layers	Hidden	Heads	Params
BERT-Base	12	768	12	110M
BERT-Large	24	1024	16	340M

BERT uses only the encoder half of the Transformer. No causal mask, no decoder. Every token attends to every other token in both directions. This makes BERT excellent at understanding tasks (classification, NER, QA) but incapable of generation — it can't produce text left to right because it was never trained that way.

The simulation below walks through MLM step by step. A sentence is shown, tokens are masked with the 80/10/10 strategy, bidirectional attention flows in both directions, and the model predicts the original tokens.

Masked Language Modeling Step-Through

Click "Step" to advance through the MLM process. Watch tokens get masked, attention flow bidirectionally, and predictions appear.

Step 0: Original sentence

80/10/10 prevents train-test mismatch. If BERT always used [MASK], fine-tuning on real text would fail because the model never saw real tokens during training. The 10% random + 10% unchanged forces the model to build good representations for ALL positions, not just masked ones.

Fine-tuning BERT

Fine-tuning BERT is beautifully simple. Take the pre-trained model, add a small task-specific head on top, and train end-to-end on labeled data:

Task	Input	Output from	Head
Classification	[CLS] sentence [SEP]	[CLS] vector	Linear → softmax
NER	[CLS] sentence [SEP]	All token vectors	Linear per token
QA (SQuAD)	[CLS] question [SEP] passage [SEP]	Each passage token	Predict start + end span
Sentence pair	[CLS] sent_A [SEP] sent_B [SEP]	[CLS] vector	Linear → softmax

The same pre-trained BERT model serves all these tasks. Only the tiny head changes. Fine-tuning takes 1-3 epochs on the task-specific data — typically 30 minutes on a single GPU. Pre-training took 4 days on 16 TPUs. The asymmetry is staggering: 99.9% of the compute is shared across all tasks.

Why does BERT mask tokens using 80% [MASK], 10% random, 10% unchanged — instead of 100% [MASK]?

To prevent train-test mismatch — during fine-tuning there are no [MASK] tokens To save computation by masking fewer tokens To make training faster

Chapter 3: GPT — Autoregressive

Your phone suggests the next word as you type. "I'm going to the..." and it suggests "store," "gym," "doctor." That's an autoregressive language model — and it's exactly how GPT learns everything it knows.

GPT (Generative Pre-trained Transformer) was published by OpenAI in June 2018, four months before BERT. Its approach is simpler: predict the next token, left to right, one at a time. No masking, no corruption, no special tokens. Just read everything so far and predict what comes next.

Autoregressive Language Modeling

An autoregressive model factors the probability of a sequence as a product of conditional probabilities:

P(x₁, ..., x_n) = ∏_t=1ⁿ P(x_t | x₁, ..., x_t−1)

Each token is predicted from all PREVIOUS tokens. This is the chain rule of probability — mathematically exact, not an approximation. The training objective is to maximize the likelihood of the training data under this factorization:

L_AR = − ∑_t=1ⁿ log P(x_t | x_<t; θ)

Compare this to BERT's objective: BERT predicts 15% of randomly masked tokens from bidirectional context. GPT predicts 100% of tokens from left-only context. GPT sees less context per prediction but trains on every single token — no waste.

The Causal Mask

To enforce left-to-right prediction, GPT uses a causal attention mask. This is a triangular matrix that prevents each position from attending to future positions. Token 5 can attend to tokens 1-5 but not tokens 6, 7, 8, etc. The mask is applied by setting future attention scores to −∞ before softmax, which drives their weights to zero.

Attention(Q, K, V) = softmax(QK^T / √d_k + M) · V

Where M is the causal mask: M_ij = 0 if j ≤ i, and M_ij = −∞ if j > i. This single matrix is what separates GPT from BERT architecturally.

Generation: The Payoff

Because GPT was trained to predict the next token, it can generate text by repeated sampling. Start with a prompt, predict the next token, append it, predict the next token after that, append again, and so on. Each new token is sampled from the probability distribution over the vocabulary.

Temperature controls the randomness of generation. Temperature T scales the logits before softmax:

P(x_t = w) = exp(z_w / T) / ∑_v exp(z_v / T)

T = 1.0 gives the model's learned distribution. T < 1.0 makes the distribution sharper (more deterministic, picks the most likely token). T > 1.0 makes it flatter (more random, more creative). T → 0 is greedy decoding (always pick the highest-probability token).

The simulation below shows autoregressive generation with a causal mask. Step through token by token, watching each new token attend only to previous tokens. Adjust temperature to see how randomness affects generation.

Autoregressive Generation with Causal Mask

Click "Generate" to produce the next token. The causal mask (triangle) shows which positions each token can attend to. Adjust temperature to control randomness.

Temperature 1.0

GPT sees only left context — seems like a handicap. But it means GPT can generate text. BERT cannot. BERT sees both directions, which is great for understanding — but you can't generate text when your model expects to see the future. Generation requires predicting one token at a time from left to right, which is exactly what the causal mask enforces.

GPT Timeline

Model	Year	Params	Data	Key Contribution
GPT-1	2018	117M	BookCorpus (7K books)	Pre-train + fine-tune works for NLP
GPT-2	2019	1.5B	WebText (40GB)	Zero-shot transfer, no fine-tuning needed
GPT-3	2020	175B	300B tokens	In-context learning, few-shot prompting
GPT-4	2023	~1.8T (est.)	~13T tokens (est.)	Multimodal, near-human reasoning

The architecture barely changed across this entire timeline. GPT-4 is architecturally very similar to GPT-1 — just dramatically larger. The same autoregressive objective, the same causal mask, the same next-token prediction. Scale was the secret ingredient.

What does the causal attention mask do in GPT?

It masks out padding tokens at the end of the sequence It prevents each token from attending to future tokens, enforcing left-to-right prediction It randomly masks 15% of tokens for prediction

Chapter 4: BERT vs GPT

It's 2019. You need an NLP model. BERT or GPT? The answer depends entirely on what you DO with it. These two models represent fundamentally different design philosophies, and understanding the tradeoff is essential to understanding modern AI.

Encoder vs. Decoder

BERT is an encoder. It reads the entire input at once, attends bidirectionally, and produces rich representations. It's designed to UNDERSTAND text — classify it, extract entities from it, answer questions about it. But it cannot generate new text because it was never trained to predict tokens left to right.

GPT is a decoder. It reads left to right with a causal mask and produces tokens one at a time. It's designed to GENERATE text — complete prompts, write stories, translate languages. It can also understand text, but less efficiently than BERT because it only sees left context.

Property	BERT (Encoder)	GPT (Decoder)
Attention	Bidirectional (sees all tokens)	Causal (sees only left)
Training objective	Predict masked tokens (15%)	Predict next token (100%)
Token utilization	15% (only masked positions)	100% (every token is a target)
Generation	Cannot generate	Natural generation
Understanding	Strong (both sides)	Weaker (left-only context)
Fine-tuning	Add task head	Add task head OR prompt
Dominant era	2019-2020	2020-present

Task Match

Different tasks naturally suit different architectures:

Classification (sentiment, spam, topic): BERT excels. It reads the full text bidirectionally, and the [CLS] token aggregates a holistic understanding. GPT can classify by generating the label word, but it's less efficient.

Named Entity Recognition (NER): BERT excels. Each token needs context from both sides to determine if "Washington" is a person, city, or state. GPT sees only left context, missing crucial right-side information.

Text generation (stories, code, dialogue): GPT only. BERT fundamentally cannot generate because it was trained with bidirectional attention. You can't produce tokens left to right when your model expects to see the future.

Question answering: Both work, differently. BERT reads the question and passage together with bidirectional attention and predicts answer span boundaries. GPT reads the question and passage, then generates the answer as text. BERT was initially better on extractive QA (SQuAD), but GPT-3+ became better on open-ended QA through scale.

The simulation below shows both architectures side by side. Toggle between tasks to see which model architecture lights up as the natural fit.

BERT vs. GPT: Architecture & Task Match

Click a task to see which architecture is the natural fit. Attention patterns and data flow are shown for each.

Click a task to compare architectures.

Modern LLMs converged on decoder-only. The debate is settled: GPT-4, Claude, Llama, Gemini are all decoder-only. Generation subsumes understanding — you can classify by generating the label ("positive" or "negative"), extract entities by generating them, and answer questions by generating the answer. A model that can generate can do anything. An encoder that can only understand is limited to structured prediction tasks.

The Encoder-Decoder Middle Ground

There's a third option: encoder-decoder models like T5 and BART. These have a bidirectional encoder (like BERT) that reads the input and a causal decoder (like GPT) that generates the output. The decoder cross-attends to the encoder's representations.

T5 (2020) unified ALL NLP tasks into a text-to-text format. Classification? Input: "classify: I love this movie." Output: "positive." Translation? Input: "translate English to French: Hello." Output: "Bonjour." Every task becomes sequence-to-sequence generation.

Encoder-decoder models were dominant in 2020-2021, but decoder-only models won the scaling race. The architectural simplicity of decoder-only — no encoder, no cross-attention, just one stack of causal Transformer blocks — made it easier to scale to hundreds of billions of parameters. Simplicity wins at scale.

Why did modern LLMs converge on decoder-only (GPT-style) rather than encoder-only (BERT-style)?

BERT is too slow for inference Generation is the most general capability — a model that generates can also classify, extract, and answer BERT requires more parameters

Chapter 5: Pre-training Data

Llama 3 trained on 15 trillion tokens. Where do you find 15 trillion tokens of quality text? You can't just point at the internet and press "download." Raw web crawl data is filthy — spam, porn, duplicates, boilerplate, HTML artifacts, malware. Turning raw crawl data into a high-quality training corpus is an engineering discipline in itself, and arguably as important as the model architecture.

The Data Pipeline

Every modern LLM follows roughly the same data pipeline. Each stage filters out more noise, and the numbers are striking:

1. Raw Crawl

Common Crawl: ~250B pages, ~100TB of text. Massive but noisy.

↓ filter

2. Language Filter

Keep target languages (typically English-heavy). fastText classifier. Removes ~40% of pages.

↓ filter

3. Quality Filter

Perplexity filter (remove gibberish), heuristic rules (line length, symbol ratio, word repetition). Removes ~60% of remaining.

↓ filter

4. Deduplication

MinHash / exact n-gram dedup. Removes ~30-50% of surviving text.

↓ filter

5. PII Removal

Remove emails, phone numbers, SSNs, addresses. Regex + classifier.

↓ mix

6. Domain Mixing

Upsample high-quality domains (Wikipedia, books, code). Downsample web junk.

The funnel is brutal. Starting from 100TB of raw text, you might end up with 5-15TB of cleaned, deduplicated, mixed training data. Over 80% is thrown away.

Why Deduplication Matters

This deserves special attention because it's counterintuitive. Doesn't more data always help? No. Duplicate data causes two problems:

Memorization. If the model sees the same Wikipedia paragraph 100 times, it memorizes it verbatim instead of learning generalizable patterns. This inflates performance on benchmarks that overlap with training data (contamination) and wastes model capacity on rote memorization.

Distribution skew. Duplicates over-represent certain topics, writing styles, and domains. If 30% of your training data is near-duplicate boilerplate (cookie banners, privacy policies, navigation menus), the model allocates 30% of its capacity to modeling boilerplate. That's 30% less capacity for actual language understanding.

FineWeb (Hugging Face, 2024) found that 30%+ of Common Crawl is near-duplicate text. Their FineWeb-Edu dataset, aggressively filtered for educational content and deduplicated, trained better models than datasets 10x larger. Quality beats quantity.

Domain Composition

The mix of data domains dramatically affects what the model learns. Here's a typical composition for a modern LLM:

Domain	% of mix	Why this ratio
Web pages (cleaned)	~50%	Broad coverage of topics, styles, knowledge
Code (GitHub)	~15%	Reasoning ability, structured thinking, code generation
Books	~10%	Long-form coherence, deep knowledge, narrative
Academic papers	~8%	Technical accuracy, citations, formal reasoning
Wikipedia	~5%	Factual knowledge, neutral tone (upsampled 3-5x)
Conversations (forums)	~5%	Dialogue capability, informal language
Math/STEM data	~5%	Mathematical reasoning, problem solving
Other (news, legal, etc.)	~2%	Domain coverage

Notice that Wikipedia is only ~5% of the mix despite being very high quality. That's because there's only ~4B words in English Wikipedia. You can't make 5% of 15T tokens from 4B words without repeating it many times — and we just said repetition is bad. So Wikipedia is upsampled a modest 3-5x, not more.

The simulation below shows the full data pipeline as an interactive funnel. Click each stage to see how many tokens survive. The pie chart shows the final domain composition.

Data Pipeline Funnel

Click each pipeline stage to see token counts and filtering ratios. The right panel shows domain composition of the final mix.

Deduplication is not optional. FineWeb found 30%+ of Common Crawl is near-duplicate. Training on duplicated data wastes compute on memorization instead of generalization. The best models are trained on aggressively deduplicated, quality-filtered data — not the biggest possible dump of everything.

Data Scaling: The Practical Limits

Where does the data come from for 15T tokens? At roughly 4 tokens per English word, 15T tokens is about 3.75 trillion words. The entire English Wikipedia is ~4 billion words. All English books ever written are estimated at ~130 billion words. The entirety of Reddit is ~150 billion words. To reach 15T tokens, you MUST include web crawl data — there simply isn't enough high-quality curated text.

This creates a tension: you need web data for volume, but web data is lower quality. The art of pre-training data engineering is managing this tradeoff — using enough web data for scale while filtering aggressively enough to maintain quality. Llama 3's solution: process 100x more raw data than you use and keep only the best 1-5%.

Why is deduplication critical for pre-training data?

Duplicates cause memorization and skew the topic distribution, wasting model capacity Duplicates make the dataset too large to download Duplicates slow down the tokenizer

Chapter 6: Scaling Laws

You have a fixed GPU budget. Should you train a big model for fewer steps, or a small model for more steps? Before 2020, this was guesswork. Then Kaplan et al. at OpenAI discovered that language model performance follows predictable power laws in three variables: compute (C), parameters (N), and data (D).

The Power Law

The cross-entropy loss L of a language model scales as a power law with each variable, holding the others fixed:

L(N) ∼ (N_c / N)^α_N, L(D) ∼ (D_c / D)^α_D, L(C) ∼ (C_c / C)^α_C

On a log-log plot, these are straight lines. Double the compute, lose a fixed fraction of loss. Double it again, lose the same fraction. This is remarkable: it means we can predict how well a model will perform before training it, just from its size and data budget.

The exponents are approximately: α_N ≈ 0.076, α_D ≈ 0.095, α_C ≈ 0.057 (Kaplan 2020). These are small numbers, meaning you need a LOT more of each resource for meaningful improvement. 10x more compute yields only about 12% lower loss.

Chinchilla: The Optimal Allocation

Given a fixed compute budget C, how should you split it between model size N and data D? Training a model with N parameters on D tokens costs approximately C ≈ 6ND FLOPs (6 FLOPs per token per parameter — 3 for forward, 3 for backward).

Kaplan et al. (2020) suggested training large models on relatively little data. Their recommendation: scale N faster than D. GPT-3 followed this advice — 175B parameters trained on only 300B tokens.

Then Hoffmann et al. (2022) at DeepMind trained the Chinchilla model and showed Kaplan was wrong. The optimal allocation is to scale N and D proportionally:

N_opt ∝ C^0.5, D_opt ∝ C^0.5

In plain English: for every doubling of compute, double BOTH the model size AND the data. The optimal ratio is approximately 20 tokens per parameter. A 10B-parameter model should be trained on 200B tokens. A 70B model on 1.4T tokens.

GPT-3 Was 10x Undertrained

By Chinchilla's formula, GPT-3's 175B parameters should have been trained on ~3.5 trillion tokens. It was trained on only 300B — more than 10x undertrained. Chinchilla (70B params, 1.4T tokens) matched GPT-3's performance with less than half the parameters because it used its compute budget more efficiently.

Model	N (params)	D (tokens)	Tokens/param	Chinchilla-optimal?
GPT-3	175B	300B	1.7	No (10x undertrained)
Chinchilla	70B	1.4T	20	Yes (defined it)
Llama 1 (65B)	65B	1.4T	21.5	~Yes
Llama 2 (70B)	70B	2T	28.6	Overtrained (intentionally)
Llama 3 (70B)	70B	15T	214	Massively overtrained

Wait — Llama 3 trains 70B parameters on 15T tokens, 10x past Chinchilla-optimal. Why? Because Chinchilla optimality minimizes training compute. But Meta cares about inference compute. A smaller model trained longer is cheaper to deploy than a larger model trained optimally. Llama 3 intentionally overtrains to get a smaller, faster model that performs as well as a Chinchilla-optimal larger one.

GPT-3: 300B tokens for 175B params. Chinchilla optimal: ~3.5T tokens. GPT-3 was 10x undertrained. Chinchilla (70B) matched GPT-3's performance because it allocated compute efficiently: proportional scaling of parameters and data. This single paper changed how every lab trains models.

The simulation below lets you explore scaling laws interactively. Set a compute budget and watch how Chinchilla-optimal model size and data grow in lockstep. Actual models are plotted as reference points.

Scaling Law Explorer

Drag the compute slider to change the budget. The curves show how optimal model size (N) and data (D) scale. Dots show real models.

Log Compute (FLOPs) 10^23

Emergent Abilities

Scaling laws predict smooth improvement in loss. But some capabilities appear to emerge suddenly at certain scales. Chain-of-thought reasoning, multi-step arithmetic, and instruction following seem absent in small models and present in large ones, with sharp transitions. Whether these are truly "emergent" (unpredictable) or just artifacts of how we measure (choosing metrics that show sharp transitions) is an active debate.

What's not debated: bigger models with more data consistently do better. The power law has held across 7+ orders of magnitude of compute, from 10¹⁸ to 10²⁵ FLOPs. No ceiling has been observed yet.

What was wrong with GPT-3's compute allocation according to Chinchilla scaling laws?

GPT-3 had too many layers GPT-3 used the wrong optimizer GPT-3 was ~10x undertrained — too many parameters, too few tokens

Chapter 7: Training Pipeline

Let's train a language model. Not conceptually — actually step through every operation that happens on every GPU, from raw text to weight update. Understanding this pipeline is the difference between reading about swimming and getting in the water.

Step 1: Tokenization

Raw text enters the pipeline as Unicode strings. The tokenizer converts these into integer token IDs. Modern LLMs use BPE (Byte Pair Encoding) or SentencePiece, which learn a vocabulary of subword units from the training data. Common words become single tokens ("the" → 1820). Rare words are split into pieces ("unfathomable" → ["un", "fath", "om", "able"]).

The vocabulary size V is typically 32K-128K. Llama 3 uses a 128K vocab (up from 32K in Llama 2) because a larger vocab means fewer tokens per document, which means faster training and longer effective context.

Step 2: Batching

Tokenized documents are concatenated into a single long stream, then sliced into fixed-length sequences of T tokens (the context length). These sequences are grouped into batches of B sequences. Each training step processes a batch of shape [B, T].

Typical values: B = 1024-4096 sequences, T = 2048-8192 tokens. That's 2-32 million tokens per batch. With 15T tokens total, Llama 3's training takes roughly 500K-1M gradient steps.

Step 3: Embedding

Each token ID is looked up in an embedding table of shape [V, D], where D is the model dimension (4096 for Llama 3 70B). The batch of token IDs [B, T] becomes a tensor of embeddings [B, T, D]. Each token is now a D-dimensional vector that the Transformer can process.

Step 4: Transformer Forward Pass

The embedding tensor [B, T, D] passes through L Transformer blocks, each applying:

RMSNorm

[B, T, D] → [B, T, D] — normalize activations

↓

Causal Self-Attention

[B, T, D] → Q,K,V → attention → [B, T, D]

+ residual ↓

RMSNorm

[B, T, D] → [B, T, D]

↓

SwiGLU FFN

[B, T, D] → [B, T, 4D] → [B, T, D]

+ residual ↓

After L blocks (80 for Llama 3 70B), the output tensor is still [B, T, D].

Step 5: Language Model Head

A final linear projection maps [B, T, D] → [B, T, V], producing logits (raw scores) over the vocabulary for each position. These are NOT probabilities yet — they're unnormalized log-odds.

Step 6: Loss Computation

This is the crucial step that makes autoregressive training work. The targets are the SHIFTED input: for each position t, the target is the token at position t+1. If the input is ["The", "cat", "sat"], the targets are ["cat", "sat", ""]. We compute cross-entropy loss between the logits at position t and the target token at position t+1:

L = −(1/T) ∑_t=0^T−1 log P(x_t+1 | x_≤t)

This is averaged over all positions in the batch. The loss measures how surprised the model is by the next token. A perfect model would have loss 0 (always predicts the correct next token with probability 1). In practice, natural language has inherent entropy — the best possible loss is around 1.0-1.5 nats because language is genuinely unpredictable.

Loss on SHIFTED targets: predict token t+1 from tokens 0..t. This single detail is why autoregressive models generate. During training, every position learns to predict the next position. During inference, we sample from the last position's predictions to extend the sequence. The training objective IS the generation mechanism.

Step 7: Backward Pass

PyTorch's autograd computes gradients of the loss with respect to every parameter. For a 70B-parameter model, this means 70 billion gradient values, each computed by the chain rule through potentially 80 Transformer blocks. The backward pass takes roughly 2x the compute of the forward pass (because it must compute gradients through every operation).

Step 8: Optimizer Update (AdamW)

Modern LLMs use AdamW, which maintains two running statistics for each parameter:

m_t = β₁ m_t−1 + (1 − β₁) g_t (first moment / mean)

v_t = β₂ v_t−1 + (1 − β₂) g_t² (second moment / variance)

θ_t = θ_t−1 − η (m̂_t / (√v̂_t + ε) + λ θ_t−1)

AdamW stores m and v for every parameter, so optimizer state is 2x the model size. A 70B model requires 140B floats of optimizer state — about 560GB in FP32. This is why training a 70B model requires multiple GPUs even before considering the model weights themselves.

Learning Rate Schedule

The learning rate follows a cosine schedule with linear warmup:

Warmup (first ~2000 steps): LR ramps from 0 to peak (e.g., 3e-4). Starting with a large LR would cause divergence because the model's random initial weights produce huge gradients.

Cosine decay (remaining steps): LR follows a cosine curve from peak to 10% of peak. This gentle decay lets the model make large updates early (when there's a lot to learn) and small, precise updates later (when it's refining).

The showcase simulation below animates the full pipeline. Watch raw text flow through tokenization, embedding, transformer blocks, and out as logits. A live loss curve tracks training progress. Adjust the learning rate to see how it affects convergence.

Full Pre-training Pipeline

Click "Play" to animate the pipeline. Each frame is one training step: text → tokens → embeddings → transformer → logits → loss → update. The loss curve shows progress.

Learning Rate 1.5e-4

Step 0 / Loss: —

Memory Budget

Training a 70B model requires approximately:

Component	Size (FP32)	Size (mixed FP16/32)
Model weights	280 GB	140 GB (FP16)
Gradients	280 GB	140 GB (FP16)
Optimizer state (m, v)	560 GB	560 GB (FP32 required)
Activations (per batch)	~200-500 GB	~200-500 GB
Total	~1.3-1.6 TB	~1.0-1.3 TB

An H100 GPU has 80 GB of memory. Training 70B requires at minimum 16-32 H100s using model parallelism (splitting the model across GPUs) and pipeline parallelism (splitting the sequence of layers across stages). Llama 3 used 16,384 H100 GPUs. The parallelism strategy is its own engineering discipline.

Chapter 8: Modern Pre-training — Llama 3

Llama 3 (Meta, 2024) proved you don't need a trillion parameters — you need the right architecture and enough data. The 405B model matches GPT-4 on most benchmarks. The 70B model, accessible to researchers, is the best open-weight model in its class. The architecture is a refined version of the original Transformer with three key upgrades: GQA, SwiGLU, and RoPE.

RMSNorm (replacing LayerNorm)

The original Transformer uses LayerNorm, which normalizes by subtracting the mean and dividing by the standard deviation. RMSNorm (Root Mean Square Normalization) drops the mean-centering step:

RMSNorm(x) = x / √(mean(x²)) · γ

Why drop the mean? Empirically, the re-centering in LayerNorm adds computation without improving performance. RMSNorm is simpler and ~10% faster. The gain is small per operation but compounds across billions of tokens and hundreds of layers.

SwiGLU (replacing ReLU FFN)

The original Transformer's FFN applies ReLU(xW₁)W₂. Llama uses SwiGLU, a gated activation:

SwiGLU(x) = (SiLU(xW_gate) ⊙ xW_up) W_down

Where SiLU(z) = z · σ(z) (Swish activation) and ⊙ is element-wise multiplication. The "gate" decides which dimensions to pass through. This adds a third weight matrix (W_gate) but consistently improves quality across model sizes. The FFN dimension is set to 8/3 · D (rounded to a multiple of 256) to keep the total parameter count comparable despite the extra matrix.

RoPE (Rotary Position Embedding)

The original Transformer adds sinusoidal position embeddings to the input. Llama uses RoPE (Rotary Position Embeddings), which encodes position by rotating the query and key vectors in 2D subspaces:

RoPE(q, m) = R_m q, RoPE(k, n) = R_n k

Where R_m is a block-diagonal rotation matrix with rotation angles that increase with position m and decrease with dimension index. The dot product q^Tk then depends only on the relative position (m − n), not the absolute positions.

RoPE has two advantages over learned absolute embeddings: it naturally generalizes to longer sequences (just keep rotating) and it encodes relative position (which matters more than absolute position for language understanding).

GQA (Grouped-Query Attention)

Standard Multi-Head Attention (MHA) has separate Q, K, V projections per head. With 128 heads, that's 128 sets of K and V. During inference, all these K,V pairs must be cached (the KV cache) for fast autoregressive generation. For a 405B model with 128 heads and 128K context, the KV cache can exceed 100 GB.

Grouped-Query Attention (GQA) shares K,V across groups of query heads. Llama 3 uses 8 KV heads for 128 query heads — each KV head is shared by 16 query heads. This reduces the KV cache by 16x with minimal quality loss.

Method	Query heads	KV heads	KV cache size	Quality
MHA	128	128	1x (baseline)	Best
GQA (Llama 3)	128	8	1/16x	~Same
MQA	128	1	1/128x	Slight degradation

The simulation below shows the Llama 3 block diagram. Click on each component to see its internals, data flow, and tensor shapes. Compare MHA vs GQA vs MQA attention patterns.

Llama 3 Architecture Explorer

Click a component to see its internals. Toggle attention mode to compare MHA, GQA, and MQA.

Click a component to explore.

Llama 3 405B: GQA with 8 KV heads for 128 query heads. That's a 16x KV cache reduction. For a model serving millions of users, this is the difference between needing 8 GPUs per instance and needing 1. GQA is now standard in every serious LLM deployment.

Llama 3 Training Details

Spec	Llama 3 8B	Llama 3 70B	Llama 3 405B
Layers	32	80	126
Hidden dim D	4096	8192	16384
Attention heads	32	64	128
KV heads (GQA)	8	8	8
FFN dim	14336	28672	53248
Context length	8192	8192	8192 (extended to 128K)
Vocab size	128K	128K	128K
Training tokens	15T	15T	15T
GPUs	~2K H100	~6K H100	16K H100

Notice: all three model sizes train on the same 15T tokens. This is the "overtrain small models" strategy from Chapter 6 — deliberately exceeding Chinchilla-optimal data ratios to produce smaller, faster models that perform above their parameter count. The 8B model, with 15T tokens (1875 tokens/param), is massively overtrained by Chinchilla standards but outperforms Chinchilla-optimal 30B models.

How does Grouped-Query Attention (GQA) reduce the KV cache compared to standard Multi-Head Attention?

It uses smaller hidden dimensions It shares K,V projections across groups of query heads, so fewer K,V pairs need caching It compresses the KV cache using quantization

Chapter 9: Connections

Pre-training is the foundation. Everything that makes modern LLMs useful — instruction following, safety, reasoning — is built on top of a pre-trained model. Understanding pre-training is understanding the engine; everything else is bodywork.

Papers

BERT: Pre-training of Deep Bidirectional Transformers (Devlin et al., 2018) — Masked language modeling. The paper that proved pre-training transforms NLP.
Contextual Word Representations (Smith, 2019) — Survey of the transition from static to contextual embeddings.
The Llama 3 Herd of Models (Meta, 2024) — The definitive open-source LLM. Architecture, data, training, and post-training details.

BERT vs GPT vs Llama 3

Property	BERT (2018)	GPT-3 (2020)	Llama 3 405B (2024)
Architecture	Encoder-only	Decoder-only	Decoder-only
Attention	Bidirectional	Causal (left-only)	Causal + GQA
Objective	MLM (15% masked)	Next-token prediction	Next-token prediction
Parameters	340M (Large)	175B	405B
Training data	3.3B words	300B tokens	15T tokens
Vocab size	30K (WordPiece)	50K (BPE)	128K (BPE)
Position encoding	Learned absolute	Learned absolute	RoPE (relative)
Normalization	LayerNorm (post)	LayerNorm (pre)	RMSNorm (pre)
FFN activation	GELU	GELU	SwiGLU
Can generate?	No	Yes	Yes
Open weights?	Yes	API only	Yes

Where to Go Next

L05: The Transformer — The architecture that makes pre-training possible. Self-attention, multi-head, positional encoding from scratch.
L08: Post-training — What happens AFTER pre-training: SFT, RLHF, DPO. How raw pre-trained models become useful assistants.
GPT Deep Dive — Standalone lesson on GPT architecture, training, and generation with interactive simulations.
Transformer Deep Dive — Full Transformer lesson with builder simulation.

The Pre-training Revolution Timeline

Year	Model	Key Innovation
2013	Word2Vec	Static word embeddings from unlabeled text
2014	GloVe	Global co-occurrence statistics
2017	Transformer	Attention replaces recurrence
2018 (Feb)	ELMo	Contextual embeddings from biLSTM
2018 (Jun)	GPT-1	Decoder pre-training + fine-tuning
2018 (Oct)	BERT	Bidirectional encoder, MLM
2019	GPT-2	Zero-shot transfer, scale matters
2019	RoBERTa	Better BERT training (no NSP, more data)
2020	GPT-3	In-context learning at 175B scale
2020	T5	Text-to-text encoder-decoder unification
2022	Chinchilla	Optimal compute-data-parameters tradeoff
2023	Llama 1/2	Open-weight models match proprietary
2024	Llama 3	15T tokens, GQA, 405B open weights

The trajectory is clear: the field moved from static representations (Word2Vec) to contextual (ELMo) to pre-trained + fine-tuned (BERT/GPT) to scale-is-all-you-need (GPT-3/Chinchilla) to open, optimized foundation models (Llama 3). Each step built on the previous. Pre-training is the substrate on which all modern AI is built.

"The unreasonable effectiveness of data." — The defining insight of the pre-training era is that a simple objective (predict the next token) applied to enough data produces models that reason, code, translate, and create. No task-specific engineering required. Just data, compute, and the Transformer.