Train once on the internet, fine-tune for anything — how foundation models are built.
Training from scratch for every task is like rebuilding a car engine every time you drive somewhere new. Want to classify movie reviews? Train a Transformer on 10,000 labeled reviews from random initialization. Want to detect spam? Start from scratch again, another 10,000 labels. Want to answer questions? Throw away everything you learned and start over. Each model sees only its tiny labeled dataset and knows nothing about language in general.
This is wildly inefficient. A child who learns to read doesn't re-learn the alphabet for every new book. They build a general understanding of language — grammar, vocabulary, common sense — and then apply it to any new text. Can we give neural networks the same head start?
The answer is pre-training: train one massive model on billions of words of text, learning the statistical structure of language itself. This model doesn't solve any particular task — it learns representations that are useful for ALL tasks. Then, for each specific task, you take this pre-trained model and fine-tune it on a small labeled dataset. The model already knows what words mean, how grammar works, and what the world looks like through text. It just needs a nudge to solve your particular problem.
The economics are dramatic. Pre-training costs millions of dollars and takes months on thousands of GPUs. But you only do it once. Fine-tuning costs a few hundred dollars and takes hours on a single GPU. One expensive foundation supports hundreds of cheap adaptations.
Before pre-training, every NLP task trained from random initialization. The model started knowing nothing — every weight was a random number. All knowledge had to come from the task-specific labeled data. If you only had 5,000 labeled examples (common in NLP), the model could barely learn the basics before overfitting.
Pre-training flips this. The model starts with knowledge of language, learned from unlabeled text (which is essentially free — the internet produces trillions of words). Fine-tuning just steers this knowledge toward a specific task. Even with 100 labeled examples, a pre-trained model can perform well because it already understands language. The labeled data just teaches it the task format.
The simulation below shows this difference. On the left, N tasks each train from random initialization — slow convergence, each model learns in isolation. On the right, one model is pre-trained on massive data, then quickly fine-tuned into N specialized models. Click to toggle between paradigms and watch the cost bars animate.
Click "Toggle" to switch paradigms. Left: each task trains from scratch. Right: one pre-trained model forks into fine-tuned heads.
Three things converged in 2018 to make pre-training work for NLP:
| Factor | Before 2018 | After 2018 |
|---|---|---|
| Architecture | LSTMs (sequential, slow) | Transformers (parallel, scalable) |
| Compute | Single GPU, days | TPU pods / GPU clusters, weeks |
| Self-supervision | Word2Vec (static, shallow) | MLM / autoregressive (contextual, deep) |
The Transformer (2017) made it possible to train on massive data efficiently. TPU pods at Google made it affordable (barely). And the key insight — that predicting masked or next words provides a rich supervisory signal — made it work. ELMo (Feb 2018), GPT (Jun 2018), and BERT (Oct 2018) all arrived within months of each other. The race was on.
"Bank" means different things in "river bank" and "bank account" — but Word2Vec gives them the same vector. One fixed point in embedding space, regardless of context. This is called the polysemy problem: many words have multiple meanings, and a single static vector can't capture them all.
Word2Vec and GloVe were revolutionary in 2013-2014. They proved that you could learn useful word representations from unlabeled text. But they had a fundamental limitation: each word gets exactly ONE vector, computed once and frozen. The vector for "bank" is the average of all its uses — part financial institution, part riverbed, part pool-shot cushion. It's a compromise that doesn't fully represent any of them.
Consider these sentences:
A model using Word2Vec embeddings sees identical inputs for these two completely different meanings. It must rely entirely on the downstream task model (an LSTM, a classifier) to figure out which meaning is intended. That's asking a lot of a small task-specific model.
What if the embedding itself depended on the surrounding words? Instead of one fixed vector per word type, we compute a different vector for each word token (each occurrence in context). "Bank" next to "money" and "deposited" should produce a finance-flavored vector. "Bank" next to "river" and "sat" should produce a geography-flavored vector.
This is what contextual embeddings do. A deep model (LSTM in ELMo, Transformer in BERT/GPT) reads the entire sentence and produces a representation for each token that depends on every other token. The same word gets different vectors in different contexts.
The representation hi for position i depends on ALL input tokens x1 through xn, not just xi alone. This is a function of the entire sequence, evaluated at position i.
ELMo (Embeddings from Language Models) was the first widely successful contextual embedding method. It trained a bidirectional LSTM language model on 1 billion words, then used the internal hidden states as features. The key insight: different layers of the LSTM capture different types of information.
| Layer | What it captures | Useful for |
|---|---|---|
| Layer 0 (char CNN) | Morphology, spelling | POS tagging, NER |
| Layer 1 (LSTM) | Syntax, grammar | Parsing, chunking |
| Layer 2 (LSTM) | Semantics, meaning | Sentiment, QA |
ELMo's approach was simple: run the pre-trained LSTM, extract hidden states from all layers, and let the downstream task learn a weighted combination. Different tasks weight different layers — syntax-heavy tasks prefer Layer 1, semantics-heavy tasks prefer Layer 2. This was the first proof that pre-trained representations transfer across tasks.
The simulation below shows the polysemy problem and its solution. Select a word, then see its static embedding as a fixed point and its contextual embeddings as different points that move based on sentence context.
Click a word to select it. The orange dot is the static (Word2Vec) embedding. Teal dots are contextual embeddings — different for each sentence.
ELMo used the pre-trained model as a feature extractor — freeze the LSTM weights, extract hidden states, feed them to a task-specific model. This works but limits how much the representations can adapt to the task.
BERT and GPT went further: fine-tune the entire model. Update all the pre-trained weights on the task-specific data. This lets the representations themselves adapt to the task, not just the task head on top. Fine-tuning consistently outperforms feature extraction because the model can reshape its internal representations to focus on what the task needs.
Read this sentence: "The cat sat on the ___." You use both sides — "The cat sat on the" and the period at the end — to guess the blank. You consider left AND right context simultaneously. That's exactly how BERT learns.
BERT (Bidirectional Encoder Representations from Transformers) was published by Google in October 2018 and immediately demolished every NLP benchmark. Its key innovation: Masked Language Modeling (MLM). Take a sentence, randomly mask 15% of the tokens, and train the model to predict them from the surrounding context — both left and right.
Given an input sentence, BERT's training procedure is:
The 80/10/10 split is not arbitrary. It solves a subtle problem.
If BERT always replaced selected tokens with [MASK], it would learn to only pay attention when it sees [MASK]. But during fine-tuning, there are no [MASK] tokens — the model sees real text. This train-test mismatch would degrade performance.
The solution: 80% of the time, replace with [MASK] (the model must use context). 10% of the time, replace with a random word (the model must detect errors). 10% of the time, keep the original word (the model must learn that correct words are also possible answers). This way, the model doesn't rely on the special [MASK] token and learns robust representations even for unmasked positions.
The loss sums over only the masked positions. x\i means "all tokens except position i" — bidirectional context. This is the key difference from GPT: BERT sees both left AND right context when predicting each masked token.
BERT also trained on Next Sentence Prediction (NSP): given two sentences A and B, predict whether B actually follows A in the original text or is a random sentence. The [CLS] token's output is used for this binary classification. Later work (RoBERTa) showed that NSP actually hurts performance and dropped it. The key pre-training signal is MLM alone.
| Variant | Layers | Hidden | Heads | Params |
|---|---|---|---|---|
| BERT-Base | 12 | 768 | 12 | 110M |
| BERT-Large | 24 | 1024 | 16 | 340M |
BERT uses only the encoder half of the Transformer. No causal mask, no decoder. Every token attends to every other token in both directions. This makes BERT excellent at understanding tasks (classification, NER, QA) but incapable of generation — it can't produce text left to right because it was never trained that way.
The simulation below walks through MLM step by step. A sentence is shown, tokens are masked with the 80/10/10 strategy, bidirectional attention flows in both directions, and the model predicts the original tokens.
Click "Step" to advance through the MLM process. Watch tokens get masked, attention flow bidirectionally, and predictions appear.
Fine-tuning BERT is beautifully simple. Take the pre-trained model, add a small task-specific head on top, and train end-to-end on labeled data:
| Task | Input | Output from | Head |
|---|---|---|---|
| Classification | [CLS] sentence [SEP] | [CLS] vector | Linear → softmax |
| NER | [CLS] sentence [SEP] | All token vectors | Linear per token |
| QA (SQuAD) | [CLS] question [SEP] passage [SEP] | Each passage token | Predict start + end span |
| Sentence pair | [CLS] sent_A [SEP] sent_B [SEP] | [CLS] vector | Linear → softmax |
The same pre-trained BERT model serves all these tasks. Only the tiny head changes. Fine-tuning takes 1-3 epochs on the task-specific data — typically 30 minutes on a single GPU. Pre-training took 4 days on 16 TPUs. The asymmetry is staggering: 99.9% of the compute is shared across all tasks.
Your phone suggests the next word as you type. "I'm going to the..." and it suggests "store," "gym," "doctor." That's an autoregressive language model — and it's exactly how GPT learns everything it knows.
GPT (Generative Pre-trained Transformer) was published by OpenAI in June 2018, four months before BERT. Its approach is simpler: predict the next token, left to right, one at a time. No masking, no corruption, no special tokens. Just read everything so far and predict what comes next.
An autoregressive model factors the probability of a sequence as a product of conditional probabilities:
Each token is predicted from all PREVIOUS tokens. This is the chain rule of probability — mathematically exact, not an approximation. The training objective is to maximize the likelihood of the training data under this factorization:
Compare this to BERT's objective: BERT predicts 15% of randomly masked tokens from bidirectional context. GPT predicts 100% of tokens from left-only context. GPT sees less context per prediction but trains on every single token — no waste.
To enforce left-to-right prediction, GPT uses a causal attention mask. This is a triangular matrix that prevents each position from attending to future positions. Token 5 can attend to tokens 1-5 but not tokens 6, 7, 8, etc. The mask is applied by setting future attention scores to −∞ before softmax, which drives their weights to zero.
Where M is the causal mask: Mij = 0 if j ≤ i, and Mij = −∞ if j > i. This single matrix is what separates GPT from BERT architecturally.
Because GPT was trained to predict the next token, it can generate text by repeated sampling. Start with a prompt, predict the next token, append it, predict the next token after that, append again, and so on. Each new token is sampled from the probability distribution over the vocabulary.
Temperature controls the randomness of generation. Temperature T scales the logits before softmax:
T = 1.0 gives the model's learned distribution. T < 1.0 makes the distribution sharper (more deterministic, picks the most likely token). T > 1.0 makes it flatter (more random, more creative). T → 0 is greedy decoding (always pick the highest-probability token).
The simulation below shows autoregressive generation with a causal mask. Step through token by token, watching each new token attend only to previous tokens. Adjust temperature to see how randomness affects generation.
Click "Generate" to produce the next token. The causal mask (triangle) shows which positions each token can attend to. Adjust temperature to control randomness.
| Model | Year | Params | Data | Key Contribution |
|---|---|---|---|---|
| GPT-1 | 2018 | 117M | BookCorpus (7K books) | Pre-train + fine-tune works for NLP |
| GPT-2 | 2019 | 1.5B | WebText (40GB) | Zero-shot transfer, no fine-tuning needed |
| GPT-3 | 2020 | 175B | 300B tokens | In-context learning, few-shot prompting |
| GPT-4 | 2023 | ~1.8T (est.) | ~13T tokens (est.) | Multimodal, near-human reasoning |
The architecture barely changed across this entire timeline. GPT-4 is architecturally very similar to GPT-1 — just dramatically larger. The same autoregressive objective, the same causal mask, the same next-token prediction. Scale was the secret ingredient.
It's 2019. You need an NLP model. BERT or GPT? The answer depends entirely on what you DO with it. These two models represent fundamentally different design philosophies, and understanding the tradeoff is essential to understanding modern AI.
BERT is an encoder. It reads the entire input at once, attends bidirectionally, and produces rich representations. It's designed to UNDERSTAND text — classify it, extract entities from it, answer questions about it. But it cannot generate new text because it was never trained to predict tokens left to right.
GPT is a decoder. It reads left to right with a causal mask and produces tokens one at a time. It's designed to GENERATE text — complete prompts, write stories, translate languages. It can also understand text, but less efficiently than BERT because it only sees left context.
| Property | BERT (Encoder) | GPT (Decoder) |
|---|---|---|
| Attention | Bidirectional (sees all tokens) | Causal (sees only left) |
| Training objective | Predict masked tokens (15%) | Predict next token (100%) |
| Token utilization | 15% (only masked positions) | 100% (every token is a target) |
| Generation | Cannot generate | Natural generation |
| Understanding | Strong (both sides) | Weaker (left-only context) |
| Fine-tuning | Add task head | Add task head OR prompt |
| Dominant era | 2019-2020 | 2020-present |
Different tasks naturally suit different architectures:
Classification (sentiment, spam, topic): BERT excels. It reads the full text bidirectionally, and the [CLS] token aggregates a holistic understanding. GPT can classify by generating the label word, but it's less efficient.
Named Entity Recognition (NER): BERT excels. Each token needs context from both sides to determine if "Washington" is a person, city, or state. GPT sees only left context, missing crucial right-side information.
Text generation (stories, code, dialogue): GPT only. BERT fundamentally cannot generate because it was trained with bidirectional attention. You can't produce tokens left to right when your model expects to see the future.
Question answering: Both work, differently. BERT reads the question and passage together with bidirectional attention and predicts answer span boundaries. GPT reads the question and passage, then generates the answer as text. BERT was initially better on extractive QA (SQuAD), but GPT-3+ became better on open-ended QA through scale.
The simulation below shows both architectures side by side. Toggle between tasks to see which model architecture lights up as the natural fit.
Click a task to see which architecture is the natural fit. Attention patterns and data flow are shown for each.
There's a third option: encoder-decoder models like T5 and BART. These have a bidirectional encoder (like BERT) that reads the input and a causal decoder (like GPT) that generates the output. The decoder cross-attends to the encoder's representations.
T5 (2020) unified ALL NLP tasks into a text-to-text format. Classification? Input: "classify: I love this movie." Output: "positive." Translation? Input: "translate English to French: Hello." Output: "Bonjour." Every task becomes sequence-to-sequence generation.
Encoder-decoder models were dominant in 2020-2021, but decoder-only models won the scaling race. The architectural simplicity of decoder-only — no encoder, no cross-attention, just one stack of causal Transformer blocks — made it easier to scale to hundreds of billions of parameters. Simplicity wins at scale.
Llama 3 trained on 15 trillion tokens. Where do you find 15 trillion tokens of quality text? You can't just point at the internet and press "download." Raw web crawl data is filthy — spam, porn, duplicates, boilerplate, HTML artifacts, malware. Turning raw crawl data into a high-quality training corpus is an engineering discipline in itself, and arguably as important as the model architecture.
Every modern LLM follows roughly the same data pipeline. Each stage filters out more noise, and the numbers are striking:
The funnel is brutal. Starting from 100TB of raw text, you might end up with 5-15TB of cleaned, deduplicated, mixed training data. Over 80% is thrown away.
This deserves special attention because it's counterintuitive. Doesn't more data always help? No. Duplicate data causes two problems:
Memorization. If the model sees the same Wikipedia paragraph 100 times, it memorizes it verbatim instead of learning generalizable patterns. This inflates performance on benchmarks that overlap with training data (contamination) and wastes model capacity on rote memorization.
Distribution skew. Duplicates over-represent certain topics, writing styles, and domains. If 30% of your training data is near-duplicate boilerplate (cookie banners, privacy policies, navigation menus), the model allocates 30% of its capacity to modeling boilerplate. That's 30% less capacity for actual language understanding.
FineWeb (Hugging Face, 2024) found that 30%+ of Common Crawl is near-duplicate text. Their FineWeb-Edu dataset, aggressively filtered for educational content and deduplicated, trained better models than datasets 10x larger. Quality beats quantity.
The mix of data domains dramatically affects what the model learns. Here's a typical composition for a modern LLM:
| Domain | % of mix | Why this ratio |
|---|---|---|
| Web pages (cleaned) | ~50% | Broad coverage of topics, styles, knowledge |
| Code (GitHub) | ~15% | Reasoning ability, structured thinking, code generation |
| Books | ~10% | Long-form coherence, deep knowledge, narrative |
| Academic papers | ~8% | Technical accuracy, citations, formal reasoning |
| Wikipedia | ~5% | Factual knowledge, neutral tone (upsampled 3-5x) |
| Conversations (forums) | ~5% | Dialogue capability, informal language |
| Math/STEM data | ~5% | Mathematical reasoning, problem solving |
| Other (news, legal, etc.) | ~2% | Domain coverage |
Notice that Wikipedia is only ~5% of the mix despite being very high quality. That's because there's only ~4B words in English Wikipedia. You can't make 5% of 15T tokens from 4B words without repeating it many times — and we just said repetition is bad. So Wikipedia is upsampled a modest 3-5x, not more.
The simulation below shows the full data pipeline as an interactive funnel. Click each stage to see how many tokens survive. The pie chart shows the final domain composition.
Click each pipeline stage to see token counts and filtering ratios. The right panel shows domain composition of the final mix.
Where does the data come from for 15T tokens? At roughly 4 tokens per English word, 15T tokens is about 3.75 trillion words. The entire English Wikipedia is ~4 billion words. All English books ever written are estimated at ~130 billion words. The entirety of Reddit is ~150 billion words. To reach 15T tokens, you MUST include web crawl data — there simply isn't enough high-quality curated text.
This creates a tension: you need web data for volume, but web data is lower quality. The art of pre-training data engineering is managing this tradeoff — using enough web data for scale while filtering aggressively enough to maintain quality. Llama 3's solution: process 100x more raw data than you use and keep only the best 1-5%.
You have a fixed GPU budget. Should you train a big model for fewer steps, or a small model for more steps? Before 2020, this was guesswork. Then Kaplan et al. at OpenAI discovered that language model performance follows predictable power laws in three variables: compute (C), parameters (N), and data (D).
The cross-entropy loss L of a language model scales as a power law with each variable, holding the others fixed:
On a log-log plot, these are straight lines. Double the compute, lose a fixed fraction of loss. Double it again, lose the same fraction. This is remarkable: it means we can predict how well a model will perform before training it, just from its size and data budget.
The exponents are approximately: αN ≈ 0.076, αD ≈ 0.095, αC ≈ 0.057 (Kaplan 2020). These are small numbers, meaning you need a LOT more of each resource for meaningful improvement. 10x more compute yields only about 12% lower loss.
Given a fixed compute budget C, how should you split it between model size N and data D? Training a model with N parameters on D tokens costs approximately C ≈ 6ND FLOPs (6 FLOPs per token per parameter — 3 for forward, 3 for backward).
Kaplan et al. (2020) suggested training large models on relatively little data. Their recommendation: scale N faster than D. GPT-3 followed this advice — 175B parameters trained on only 300B tokens.
Then Hoffmann et al. (2022) at DeepMind trained the Chinchilla model and showed Kaplan was wrong. The optimal allocation is to scale N and D proportionally:
In plain English: for every doubling of compute, double BOTH the model size AND the data. The optimal ratio is approximately 20 tokens per parameter. A 10B-parameter model should be trained on 200B tokens. A 70B model on 1.4T tokens.
By Chinchilla's formula, GPT-3's 175B parameters should have been trained on ~3.5 trillion tokens. It was trained on only 300B — more than 10x undertrained. Chinchilla (70B params, 1.4T tokens) matched GPT-3's performance with less than half the parameters because it used its compute budget more efficiently.
| Model | N (params) | D (tokens) | Tokens/param | Chinchilla-optimal? |
|---|---|---|---|---|
| GPT-3 | 175B | 300B | 1.7 | No (10x undertrained) |
| Chinchilla | 70B | 1.4T | 20 | Yes (defined it) |
| Llama 1 (65B) | 65B | 1.4T | 21.5 | ~Yes |
| Llama 2 (70B) | 70B | 2T | 28.6 | Overtrained (intentionally) |
| Llama 3 (70B) | 70B | 15T | 214 | Massively overtrained |
Wait — Llama 3 trains 70B parameters on 15T tokens, 10x past Chinchilla-optimal. Why? Because Chinchilla optimality minimizes training compute. But Meta cares about inference compute. A smaller model trained longer is cheaper to deploy than a larger model trained optimally. Llama 3 intentionally overtrains to get a smaller, faster model that performs as well as a Chinchilla-optimal larger one.
The simulation below lets you explore scaling laws interactively. Set a compute budget and watch how Chinchilla-optimal model size and data grow in lockstep. Actual models are plotted as reference points.
Drag the compute slider to change the budget. The curves show how optimal model size (N) and data (D) scale. Dots show real models.
Scaling laws predict smooth improvement in loss. But some capabilities appear to emerge suddenly at certain scales. Chain-of-thought reasoning, multi-step arithmetic, and instruction following seem absent in small models and present in large ones, with sharp transitions. Whether these are truly "emergent" (unpredictable) or just artifacts of how we measure (choosing metrics that show sharp transitions) is an active debate.
What's not debated: bigger models with more data consistently do better. The power law has held across 7+ orders of magnitude of compute, from 1018 to 1025 FLOPs. No ceiling has been observed yet.
Let's train a language model. Not conceptually — actually step through every operation that happens on every GPU, from raw text to weight update. Understanding this pipeline is the difference between reading about swimming and getting in the water.
Raw text enters the pipeline as Unicode strings. The tokenizer converts these into integer token IDs. Modern LLMs use BPE (Byte Pair Encoding) or SentencePiece, which learn a vocabulary of subword units from the training data. Common words become single tokens ("the" → 1820). Rare words are split into pieces ("unfathomable" → ["un", "fath", "om", "able"]).
The vocabulary size V is typically 32K-128K. Llama 3 uses a 128K vocab (up from 32K in Llama 2) because a larger vocab means fewer tokens per document, which means faster training and longer effective context.
Tokenized documents are concatenated into a single long stream, then sliced into fixed-length sequences of T tokens (the context length). These sequences are grouped into batches of B sequences. Each training step processes a batch of shape [B, T].
Typical values: B = 1024-4096 sequences, T = 2048-8192 tokens. That's 2-32 million tokens per batch. With 15T tokens total, Llama 3's training takes roughly 500K-1M gradient steps.
Each token ID is looked up in an embedding table of shape [V, D], where D is the model dimension (4096 for Llama 3 70B). The batch of token IDs [B, T] becomes a tensor of embeddings [B, T, D]. Each token is now a D-dimensional vector that the Transformer can process.
The embedding tensor [B, T, D] passes through L Transformer blocks, each applying:
After L blocks (80 for Llama 3 70B), the output tensor is still [B, T, D].
A final linear projection maps [B, T, D] → [B, T, V], producing logits (raw scores) over the vocabulary for each position. These are NOT probabilities yet — they're unnormalized log-odds.
This is the crucial step that makes autoregressive training work. The targets are the SHIFTED input: for each position t, the target is the token at position t+1. If the input is ["The", "cat", "sat"], the targets are ["cat", "sat", "
This is averaged over all positions in the batch. The loss measures how surprised the model is by the next token. A perfect model would have loss 0 (always predicts the correct next token with probability 1). In practice, natural language has inherent entropy — the best possible loss is around 1.0-1.5 nats because language is genuinely unpredictable.
PyTorch's autograd computes gradients of the loss with respect to every parameter. For a 70B-parameter model, this means 70 billion gradient values, each computed by the chain rule through potentially 80 Transformer blocks. The backward pass takes roughly 2x the compute of the forward pass (because it must compute gradients through every operation).
Modern LLMs use AdamW, which maintains two running statistics for each parameter:
AdamW stores m and v for every parameter, so optimizer state is 2x the model size. A 70B model requires 140B floats of optimizer state — about 560GB in FP32. This is why training a 70B model requires multiple GPUs even before considering the model weights themselves.
The learning rate follows a cosine schedule with linear warmup:
Warmup (first ~2000 steps): LR ramps from 0 to peak (e.g., 3e-4). Starting with a large LR would cause divergence because the model's random initial weights produce huge gradients.
Cosine decay (remaining steps): LR follows a cosine curve from peak to 10% of peak. This gentle decay lets the model make large updates early (when there's a lot to learn) and small, precise updates later (when it's refining).
The showcase simulation below animates the full pipeline. Watch raw text flow through tokenization, embedding, transformer blocks, and out as logits. A live loss curve tracks training progress. Adjust the learning rate to see how it affects convergence.
Click "Play" to animate the pipeline. Each frame is one training step: text → tokens → embeddings → transformer → logits → loss → update. The loss curve shows progress.
Training a 70B model requires approximately:
| Component | Size (FP32) | Size (mixed FP16/32) |
|---|---|---|
| Model weights | 280 GB | 140 GB (FP16) |
| Gradients | 280 GB | 140 GB (FP16) |
| Optimizer state (m, v) | 560 GB | 560 GB (FP32 required) |
| Activations (per batch) | ~200-500 GB | ~200-500 GB |
| Total | ~1.3-1.6 TB | ~1.0-1.3 TB |
An H100 GPU has 80 GB of memory. Training 70B requires at minimum 16-32 H100s using model parallelism (splitting the model across GPUs) and pipeline parallelism (splitting the sequence of layers across stages). Llama 3 used 16,384 H100 GPUs. The parallelism strategy is its own engineering discipline.
Llama 3 (Meta, 2024) proved you don't need a trillion parameters — you need the right architecture and enough data. The 405B model matches GPT-4 on most benchmarks. The 70B model, accessible to researchers, is the best open-weight model in its class. The architecture is a refined version of the original Transformer with three key upgrades: GQA, SwiGLU, and RoPE.
The original Transformer uses LayerNorm, which normalizes by subtracting the mean and dividing by the standard deviation. RMSNorm (Root Mean Square Normalization) drops the mean-centering step:
Why drop the mean? Empirically, the re-centering in LayerNorm adds computation without improving performance. RMSNorm is simpler and ~10% faster. The gain is small per operation but compounds across billions of tokens and hundreds of layers.
The original Transformer's FFN applies ReLU(xW1)W2. Llama uses SwiGLU, a gated activation:
Where SiLU(z) = z · σ(z) (Swish activation) and ⊙ is element-wise multiplication. The "gate" decides which dimensions to pass through. This adds a third weight matrix (Wgate) but consistently improves quality across model sizes. The FFN dimension is set to 8/3 · D (rounded to a multiple of 256) to keep the total parameter count comparable despite the extra matrix.
The original Transformer adds sinusoidal position embeddings to the input. Llama uses RoPE (Rotary Position Embeddings), which encodes position by rotating the query and key vectors in 2D subspaces:
Where Rm is a block-diagonal rotation matrix with rotation angles that increase with position m and decrease with dimension index. The dot product qTk then depends only on the relative position (m − n), not the absolute positions.
RoPE has two advantages over learned absolute embeddings: it naturally generalizes to longer sequences (just keep rotating) and it encodes relative position (which matters more than absolute position for language understanding).
Standard Multi-Head Attention (MHA) has separate Q, K, V projections per head. With 128 heads, that's 128 sets of K and V. During inference, all these K,V pairs must be cached (the KV cache) for fast autoregressive generation. For a 405B model with 128 heads and 128K context, the KV cache can exceed 100 GB.
Grouped-Query Attention (GQA) shares K,V across groups of query heads. Llama 3 uses 8 KV heads for 128 query heads — each KV head is shared by 16 query heads. This reduces the KV cache by 16x with minimal quality loss.
| Method | Query heads | KV heads | KV cache size | Quality |
|---|---|---|---|---|
| MHA | 128 | 128 | 1x (baseline) | Best |
| GQA (Llama 3) | 128 | 8 | 1/16x | ~Same |
| MQA | 128 | 1 | 1/128x | Slight degradation |
The simulation below shows the Llama 3 block diagram. Click on each component to see its internals, data flow, and tensor shapes. Compare MHA vs GQA vs MQA attention patterns.
Click a component to see its internals. Toggle attention mode to compare MHA, GQA, and MQA.
| Spec | Llama 3 8B | Llama 3 70B | Llama 3 405B |
|---|---|---|---|
| Layers | 32 | 80 | 126 |
| Hidden dim D | 4096 | 8192 | 16384 |
| Attention heads | 32 | 64 | 128 |
| KV heads (GQA) | 8 | 8 | 8 |
| FFN dim | 14336 | 28672 | 53248 |
| Context length | 8192 | 8192 | 8192 (extended to 128K) |
| Vocab size | 128K | 128K | 128K |
| Training tokens | 15T | 15T | 15T |
| GPUs | ~2K H100 | ~6K H100 | 16K H100 |
Notice: all three model sizes train on the same 15T tokens. This is the "overtrain small models" strategy from Chapter 6 — deliberately exceeding Chinchilla-optimal data ratios to produce smaller, faster models that perform above their parameter count. The 8B model, with 15T tokens (1875 tokens/param), is massively overtrained by Chinchilla standards but outperforms Chinchilla-optimal 30B models.
Pre-training is the foundation. Everything that makes modern LLMs useful — instruction following, safety, reasoning — is built on top of a pre-trained model. Understanding pre-training is understanding the engine; everything else is bodywork.
| Property | BERT (2018) | GPT-3 (2020) | Llama 3 405B (2024) |
|---|---|---|---|
| Architecture | Encoder-only | Decoder-only | Decoder-only |
| Attention | Bidirectional | Causal (left-only) | Causal + GQA |
| Objective | MLM (15% masked) | Next-token prediction | Next-token prediction |
| Parameters | 340M (Large) | 175B | 405B |
| Training data | 3.3B words | 300B tokens | 15T tokens |
| Vocab size | 30K (WordPiece) | 50K (BPE) | 128K (BPE) |
| Position encoding | Learned absolute | Learned absolute | RoPE (relative) |
| Normalization | LayerNorm (post) | LayerNorm (pre) | RMSNorm (pre) |
| FFN activation | GELU | GELU | SwiGLU |
| Can generate? | No | Yes | Yes |
| Open weights? | Yes | API only | Yes |
| Year | Model | Key Innovation |
|---|---|---|
| 2013 | Word2Vec | Static word embeddings from unlabeled text |
| 2014 | GloVe | Global co-occurrence statistics |
| 2017 | Transformer | Attention replaces recurrence |
| 2018 (Feb) | ELMo | Contextual embeddings from biLSTM |
| 2018 (Jun) | GPT-1 | Decoder pre-training + fine-tuning |
| 2018 (Oct) | BERT | Bidirectional encoder, MLM |
| 2019 | GPT-2 | Zero-shot transfer, scale matters |
| 2019 | RoBERTa | Better BERT training (no NSP, more data) |
| 2020 | GPT-3 | In-context learning at 175B scale |
| 2020 | T5 | Text-to-text encoder-decoder unification |
| 2022 | Chinchilla | Optimal compute-data-parameters tradeoff |
| 2023 | Llama 1/2 | Open-weight models match proprietary |
| 2024 | Llama 3 | 15T tokens, GQA, 405B open weights |
The trajectory is clear: the field moved from static representations (Word2Vec) to contextual (ELMo) to pre-trained + fine-tuned (BERT/GPT) to scale-is-all-you-need (GPT-3/Chinchilla) to open, optimized foundation models (Llama 3). Each step built on the previous. Pre-training is the substrate on which all modern AI is built.