GPT-3 (Brown 2020)

Chapter 0: Why Scale?

Imagine you hire two translators. Translator A studied French for one year in school — they know the grammar rules, the common vocab, but when you hand them a legal contract, they stumble. Translator B grew up bilingual, read thousands of books, lived in Paris for a decade — you hand them the same contract and they nail it without any special preparation. What's the difference? Translator B has so much experience that novel tasks become easy.

Before GPT-3, the standard recipe for NLP was: pre-train a language model on lots of text, then fine-tune it on a labeled dataset for each specific task. Want sentiment analysis? Fine-tune on sentiment labels. Want translation? Fine-tune on parallel corpora. Want question answering? Fine-tune on QA pairs. Each task required its own labeled dataset, its own training run, and its own deployed model.

This works, but it has three fundamental problems:

Problem	Why It Hurts
Data dependency	Every new task needs thousands of labeled examples. For rare or specialized tasks, these may not exist.
Narrow generalization	A model fine-tuned on sentiment can't do translation. You need a separate model (and separate compute) for each task.
Spurious correlations	Fine-tuning on small datasets lets models exploit shortcuts — learning "negation words = negative sentiment" rather than understanding meaning.

GPT-3 asked a radical question: what if we just make the model big enough that it can do tasks without any fine-tuning at all?

The core bet of GPT-3: If a language model is large enough and trained on enough text, it will learn to perform tasks from just a few examples shown in its prompt — no gradient updates, no task-specific training. This ability is called in-context learning, and it emerges only at sufficient scale. GPT-2 (1.5B parameters) showed flickers of it. GPT-3 (175B parameters) made it work.

The result was startling. GPT-3, with no task-specific training at all, could translate languages, answer trivia, write code, solve analogies, and generate coherent articles — just from being shown a few examples in the prompt. On some benchmarks, this "few-shot" approach matched or beat models that had been explicitly fine-tuned on thousands of labeled examples.

This wasn't just an incremental improvement. It was a paradigm shift. Instead of "collect data → train model → deploy model" for every task, you could just write a prompt. The model became a general-purpose tool that could be steered with language itself.

The Scale–Capability Frontier

Drag the slider to scale up model parameters. Watch how capabilities emerge at different thresholds. Small models can barely do anything without fine-tuning. As you approach 175B, few-shot performance on diverse tasks suddenly becomes viable.

Parameters 125M

What is the fundamental limitation of the "pre-train then fine-tune" paradigm that GPT-3 aimed to overcome?

Every new task requires its own labeled dataset and dedicated fine-tuning run, creating data dependency and narrow specialization — the model can't generalize across tasks without task-specific training Fine-tuned models are too slow at inference time Pre-training is too expensive to be practical

Chapter 1: In-Context Learning

GPT-3's most important contribution isn't its size — it's the discovery that large language models can learn tasks from examples embedded in the prompt, without any weight updates. This is in-context learning (ICL).

Here's how it works. Instead of fine-tuning the model on a labeled dataset, you construct a prompt that demonstrates the task with a few examples, then ask the model to continue the pattern:

prompt
# Few-shot prompt for sentiment classification
Review: "This movie was absolutely fantastic!"
Sentiment: Positive

Review: "Terrible acting, boring plot."
Sentiment: Negative

Review: "A masterpiece of modern cinema."
Sentiment:

The model has never been trained on sentiment labels. It has never seen a loss function that says "positive" or "negative." Yet it completes the prompt with "Positive" — because it has learned the pattern from the examples in the prompt and can continue it.

In-context learning is NOT learning in the traditional sense. The model's weights don't change. There are no gradient updates. The model is simply doing next-token prediction, as it always does. But because the prompt contains examples, the model's attention mechanism can identify the pattern (input → output mapping) and apply it to new inputs. Think of it as "learning by analogy" rather than "learning by training."

Brown et al. defined three levels of in-context learning:

Setting	Examples in Prompt	Description
Zero-shot	0	Just the task instruction: "Translate English to French: cheese →"
One-shot	1	One example + the new input
Few-shot	10–100	Multiple examples + the new input (limited only by context window)

The critical finding: performance scales smoothly with both model size and number of examples. More examples help, and bigger models benefit more from additional examples. A 175B-parameter model with 50 examples can outperform a 1.3B model with the same examples by a large margin — the larger model is simply better at identifying and applying the pattern.

Why does ICL work?

This is still an active research question. The leading hypotheses:

1. Meta-learning during pre-training. During pre-training on internet text, the model encounters many naturally occurring "few-shot" patterns. Blog posts that define a term and then use it. Translation examples in language-learning websites. Q&A forums where multiple questions are answered in sequence. The model implicitly learns to recognize and continue these patterns.

2. Induction heads. Olsson et al. (2022) identified specific attention patterns called induction heads — pairs of attention heads that implement a "match and copy" behavior. One head finds a previous occurrence of the current token, another copies the token that followed. This mechanism can identify patterns in few-shot examples and apply them.

3. Implicit gradient descent. Akyürek et al. (2023) showed that Transformers can implement gradient descent internally — the forward pass through the attention layers implicitly performs optimization steps on the in-context examples. The model is effectively "training itself" on your examples during inference.

In-Context Learning Demo

Watch how adding more examples in the prompt improves GPT-3's accuracy on a task. Each bar shows model accuracy. Drag the slider to add examples and see performance improve. Notice how larger models benefit more from additional examples.

Examples 0 (zero-shot)

The prompt is the new program

ICL fundamentally changes the interface between humans and AI. Instead of writing code or collecting labeled data, you write prompts. The prompt is the program, and the model is the interpreter. This insight — that language itself can serve as a programming language for neural networks — is arguably the most consequential idea in GPT-3.

python
# Traditional ML: collect data, train model
dataset = load_dataset("sentiment")          # need labeled data
model = train(base_model, dataset, epochs=3)  # need GPU hours
result = model.predict("Great movie!")       # single task

# GPT-3 ICL: just write a prompt
prompt = """Review: "Loved it!" → Positive
Review: "Hated it." → Negative
Review: "Great movie!" →"""
result = gpt3.complete(prompt)                 # no training needed
# Returns: " Positive"

In GPT-3's in-context learning, what happens to the model's weights when it processes the few-shot examples in the prompt?

Nothing — the weights are completely frozen. The model performs standard next-token prediction; the "learning" is just the attention mechanism recognizing patterns in the prompt and continuing them The weights are updated with a small learning rate to adapt to the examples A special adapter layer is trained on the examples before generating the output

Chapter 2: Few-Shot vs Fine-Tuning

To understand why few-shot learning matters, you need to see exactly how it compares to the alternatives. Brown et al. set up a systematic comparison across dozens of benchmarks.

The key comparison is between three regimes:

Fine-Tuning (FT)

Update all model weights on task-specific labeled data. Best performance, but requires data + compute per task. One model per task.

↓

Few-Shot (FS)

No weight updates. Show K=10-100 examples in the prompt. One model for all tasks. Performance limited by context window.

↓

Zero-Shot (ZS)

No weight updates, no examples. Just a task description: "Translate to French." Hardest for the model.

The headline result: GPT-3 few-shot sometimes matches or exceeds fine-tuned BERT-Large, despite having never been trained on the task-specific data. This is remarkable because BERT-Large was explicitly fine-tuned on thousands of labeled examples for each task.

Task	BERT-Large FT	GPT-3 Few-Shot	GPT-3 Zero-Shot
TriviaQA	—	71.2%	64.3%
LAMBADA	—	86.4%	76.2%
StoryCloze	87.4%	87.7%	83.2%
SuperGLUE	69.0%	71.8%	58.9%

The crossover point: For closed-book knowledge tasks (TriviaQA, NaturalQuestions), GPT-3 few-shot is exceptionally strong because the model's 175B parameters serve as a vast knowledge store — learned from its 300B-token training set. For tasks requiring precise format following or narrow domain expertise, fine-tuning still wins. The point where few-shot overtakes fine-tuning depends on both the task and the model size.

When does each approach win?

Few-shot wins when:

1. The task is common in internet text (translation, Q&A, summarization).

2. The label space is simple (positive/negative, yes/no, a named entity).

3. You have very little labeled data (< 100 examples).

4. You need to switch tasks rapidly (many tasks, single model).

Fine-tuning wins when:

1. You have large labeled datasets (> 10K examples).

2. The task requires precise, structured outputs (NER, parsing).

3. Maximum accuracy is critical (medical, legal applications).

4. The domain is specialized and rare in pre-training data.

Few-Shot vs Fine-Tuning Tradeoff

Drag the slider to change the number of available labeled examples. The chart shows when few-shot (orange) overtakes fine-tuning (teal) as labeled data decreases. At very few examples, fine-tuning overfits and few-shot wins.

Labeled Examples 5000

Data contamination concern

One critical concern: is GPT-3's test performance genuine or did it see the test data during pre-training? With 300 billion tokens of training data scraped from the internet, some benchmark test sets inevitably leaked into the training corpus.

Brown et al. studied this by measuring overlap between their training data and benchmark test sets. They found contamination on some benchmarks but argued that removing contaminated examples had minimal effect on most results. Still, this concern foreshadowed a major issue in the field: as training corpora grow, maintaining clean evaluation becomes increasingly difficult.

python
# Data contamination check: n-gram overlap between train and test
def check_contamination(train_docs, test_example, n=13):
    """Check if any 13-gram in test_example appears in training data."""
    test_ngrams = set(get_ngrams(test_example, n))
    for doc in train_docs:
        train_ngrams = set(get_ngrams(doc, n))
        overlap = test_ngrams & train_ngrams
        if len(overlap) / len(test_ngrams) > 0.7:
            return True  # likely contaminated
    return False

In what scenario does GPT-3 few-shot learning typically outperform fine-tuning?

When you have millions of labeled examples When labeled data is scarce (under ~100 examples), because fine-tuning on too few examples leads to overfitting, while few-shot relies on the model's pre-trained knowledge When the task requires very long outputs like full documents

Chapter 3: Architecture at 175B

GPT-3 is architecturally simple — it's the same decoder-only Transformer as GPT-2, just scaled up massively. The core insight is that the architecture doesn't need to change; you just need more of it.

The full model family spans four orders of magnitude:

Model	Params	Layers	d_model	Heads	d_head	Context
GPT-3 Small	125M	12	768	12	64	2048
GPT-3 Medium	350M	24	1024	16	64	2048
GPT-3 Large	760M	24	1536	16	96	2048
GPT-3 XL	1.3B	24	2048	24	128	2048
GPT-3 2.7B	2.7B	32	2560	32	80	2048
GPT-3 6.7B	6.7B	32	4096	32	128	2048
GPT-3 13B	13B	40	5140	40	128	2048
GPT-3 175B	175B	96	12288	96	128	2048

175 billion parameters — where do they live? At 175B, the model is ~700 GB in float32 (~350 GB in float16). It cannot fit on a single GPU. GPT-3 required model parallelism across multiple GPUs, splitting the model's layers and attention heads across machines. This engineering challenge — distributing a model too large for any single device — defined the era.

Architecture details

Each of the 96 layers follows the standard GPT-2 pattern with one key modification:

Masked Multi-Head Self-Attention

96 heads, d_k=128, causal mask (can only attend to left context). Alternating dense and locally banded sparse attention in different layers.

↓ + residual + LayerNorm (pre-norm)

Feed-Forward Network

12288 → 49152 → 12288 (4x expansion). GELU activation.

↓ + residual + LayerNorm (pre-norm)

GPT-3 uses pre-norm (LayerNorm before each sub-layer) rather than post-norm (LayerNorm after). This was found to be more stable for training very deep networks — with post-norm, gradients in the first layers can explode or vanish when the network is 96 layers deep.

The sparse attention is another key architectural detail. Every other layer uses a locally banded attention pattern — each token attends only to its nearby neighbors (within a window) rather than all previous tokens. This reduces the quadratic cost of attention while still allowing long-range information to propagate through the dense layers.

Parameter count breakdown

python
# Where do 175B parameters come from?
d = 12288       # hidden dimension
L = 96          # layers
V = 50257       # vocab size (BPE tokens)
ctx = 2048      # context length

# Token embeddings: V × d
emb = V * d                        # ≈ 617M
# Position embeddings: ctx × d
pos = ctx * d                      # ≈ 25M
# Per-layer attention: 4 × d² (Q, K, V, output projections)
attn_per_layer = 4 * d * d         # ≈ 604M
# Per-layer FFN: 2 × d × 4d (up + down projections)
ffn_per_layer = 2 * d * (4 * d)   # ≈ 1.21B
# Per-layer LayerNorm: 2 × 2d (2 norms, each with scale + bias)
ln_per_layer = 2 * 2 * d           # ≈ 49K (negligible)

total = emb + pos + L * (attn_per_layer + ffn_per_layer + ln_per_layer)
print(f"Total: {total/1e9:.1f}B")     # ≈ 174.6B ✓

The FFN dominates. Each layer's FFN (1.21B params) is roughly 2x the attention parameters (604M). Across 96 layers, FFN accounts for ~116B of the 175B total — about 67%. This is why some researchers call the FFN the "memory" of the Transformer: it's where factual knowledge is stored.

GPT-3 Architecture Visualizer

Click through the model's layers. Each layer shows attention (teal) and FFN (orange) with their parameter counts. The bar on the right shows total parameter distribution. Click a layer to see its tensor shapes.

Layer 1 / 96

In GPT-3's 175B parameter model, which component accounts for the most parameters?

The token embeddings (50257 × 12288) The feed-forward networks across all 96 layers — each FFN has ~1.21B parameters (12288 → 49152 → 12288), totaling ~116B or about 67% of all parameters The attention heads across all 96 layers

Chapter 4: Training Data

A 175B-parameter model is only as good as the data it learns from. GPT-3 was trained on approximately 300 billion tokens from a carefully curated mix of internet data.

Dataset	Tokens (B)	Weight in Mix	Epochs	Description
Common Crawl (filtered)	410	60%	0.44	Filtered web text — 45 TB raw → 570 GB after quality filtering
WebText2	19	22%	2.9	Expanded GPT-2 training set — Reddit links with ≥3 upvotes
Books1	12	8%	1.9	Internet-based books corpus
Books2	55	8%	0.43	Internet-based books corpus
Wikipedia	3	3%	3.4	English Wikipedia

Sampling weights ≠ data sizes. Notice that WebText2 is only 19B tokens but gets 22% of the sampling weight, while Common Crawl is 410B tokens but only 60%. This means WebText2 examples are seen ~2.9 times on average (2.9 epochs) while Common Crawl examples are seen less than once (0.44 epochs). The reasoning: higher-quality data (curated links) should be seen more often than noisy web crawls.

Common Crawl filtering

Raw Common Crawl is extremely noisy — full of spam, boilerplate, duplicates, and garbage. OpenAI applied a multi-step filtering pipeline:

Step 1: Quality classifier

Train a logistic regression on {WebText = positive, raw CC = negative}. Keep CC documents with high "quality" scores. This is essentially asking: "does this document look like something Reddit would upvote?"

↓

Step 2: Fuzzy deduplication

Use MinHash + LSH to detect near-duplicate documents. Remove duplicates to prevent memorization and reduce training cost.

↓

Step 3: Benchmark decontamination

Remove documents with high n-gram overlap with any benchmark test set. (Imperfect — some contamination remained.)

The quality classifier is a simple but effective trick. By training a binary classifier to distinguish "curated web text" (Reddit-upvoted links) from "random web text" (Common Crawl), OpenAI created an automatic quality filter. Documents that look like something a human found worth sharing get higher scores. This filtered Common Crawl from 45 TB of raw text down to 570 GB — keeping only about 1.3%.

Tokenization

GPT-3 uses the same byte-level BPE (Byte Pair Encoding) tokenizer as GPT-2, with a vocabulary of 50,257 tokens. The tokenizer works at the byte level, so it can represent any text (including Unicode) without unknown tokens.

python
# GPT-3 tokenization example
import tiktoken
enc = tiktoken.get_encoding("gpt2")  # GPT-3 uses same tokenizer

text = "Language models are few-shot learners."
tokens = enc.encode(text)
print(tokens)    # [Languange, models, are, few, -, shot, learners, .]
print(len(tokens))  # 8 tokens for 6 words

# Each token = integer ID in [0, 50256]
# Context window = 2048 tokens ≈ 1500 words
# Few-shot examples must fit within this window!

The context window bottleneck: GPT-3's 2048-token context window limits how many few-shot examples can fit in the prompt. Each example uses maybe 20-50 tokens, so you can fit roughly 40-100 examples. This is why "few-shot" means 10-100, not 10,000. Later models (GPT-4, Claude) expanded to 8K-128K tokens, enabling more examples and longer documents.

Training Data Composition

This visualization shows the proportion of each data source in GPT-3's training mixture. The left bars show raw token counts; the right bars show sampling weights. Notice how quality sources are upsampled relative to their size.

Showing: Sampling Weights

Why does GPT-3 sample from WebText2 (19B tokens) at 22% weight but from Common Crawl (410B tokens) at only 60% weight, even though CC is 20x larger?

Because WebText2 is higher quality (curated Reddit links), so it deserves more training passes. Higher-quality data should be seen more often (2.9 epochs) while noisy web crawl data should be seen less than once (0.44 epochs) Because Common Crawl is in a different language Because the model's context window can't handle Common Crawl documents

Chapter 5: Scaling Laws

Perhaps the most impactful finding from the GPT-3 work (and its predecessor, Kaplan et al. 2020) is that language model performance follows remarkably smooth power laws as you scale up model size, data, and compute.

A power law takes the form:

L(N) = (N_c / N)^α

Where L is the loss, N is the number of parameters, N_c is a constant, and α is the scaling exponent. When you plot loss vs parameters on a log-log scale, you get a straight line. This means every 10x increase in parameters gives a predictable decrease in loss.

Scaling laws are the reason GPT-3 was built. Before you spend millions of dollars training a 175B model, you want to know: will it be worth it? The scaling laws let you predict the performance of a 175B model from experiments at 125M, 350M, and 1.3B. OpenAI trained the small models, verified the power law fit, extrapolated to 175B, and committed to the full training run based on those predictions. The laws held.

Three axes of scaling

Kaplan et al. (2020) found three independent scaling relationships:

Scaling Axis	Power Law	Exponent α	Implication
Parameters (N)	L(N) ∝ N^-0.076	0.076	10x more params → loss drops by ~17%
Data (D)	L(D) ∝ D^-0.095	0.095	10x more data → loss drops by ~20%
Compute (C)	L(C) ∝ C^-0.050	0.050	10x more compute → loss drops by ~11%

The key insight: parameters scale more efficiently than data, but data is cheaper. The optimal allocation of a fixed compute budget is roughly 5x more on parameters than data. (Later, the Chinchilla paper challenged this, arguing for roughly equal scaling.)

Compute-optimal training

Given a fixed compute budget C (measured in FLOPs), how should you divide it between model size N and training data D? The relationship is:

C ≈ 6 · N · D

Where C is in FLOPs, N is parameters, and D is training tokens. The factor of 6 comes from: ~2 FLOPs per parameter per token for the forward pass, and ~4 FLOPs per parameter per token for the backward pass (2 for gradient computation, 2 for gradient accumulation).

python
# Compute budget for GPT-3 training
N = 175e9      # 175 billion parameters
D = 300e9      # 300 billion tokens
C = 6 * N * D   # ≈ 3.15 × 10²³ FLOPs

# In GPU-hours (V100 at ~120 TFLOPS mixed precision)
v100_flops = 120e12  # FLOPS per second
hours = C / v100_flops / 3600
print(f"{hours/1000:.0f}K V100-hours")  # ≈ 729K V100-hours

# At ~$1/GPU-hour, that's ~$729K in compute
# But with pipeline parallelism, wall-clock time = ~1 month
# on a cluster of ~1000 V100s

Scaling Laws Visualizer

This log-log plot shows loss vs parameters for GPT-3's model family. The straight line on a log-log plot means a power law. Each dot is one of the 8 model sizes. Drag the slider to extend the line and predict performance at even larger scales.

Extrapolate to 175B

Emergent abilities

One of the most fascinating aspects of scaling is the emergence of qualitatively new abilities at specific scale thresholds. These aren't predicted by the smooth power law of loss — they appear suddenly:

1. In-context learning: Barely works at 1.3B, works well at 13B, works reliably at 175B.

2. Arithmetic: Two-digit addition at 6.7B, three-digit at 175B.

3. Word unscrambling: Impossible below 6.7B, reasonable at 175B.

4. Code generation: Barely functional below 13B, surprisingly capable at 175B.

The emergence mystery: Smooth scaling of loss doesn't imply smooth scaling of capabilities. A capability might require the model to compose multiple skills (e.g., arithmetic requires digit manipulation + carry propagation + ordering). Each sub-skill improves smoothly with scale, but the composite capability only "works" when ALL sub-skills cross a usability threshold simultaneously. This creates the appearance of sudden emergence.

What is a "scaling law" in the context of GPT-3, and why was it critical for OpenAI's decision to train a 175B model?

A power law relationship (L ∝ N^-α) between model size and loss that produces a straight line on log-log plots. OpenAI used small model experiments to extrapolate and predict that 175B would achieve a specific loss, justifying the multi-million-dollar training investment A rule that says larger models are always better, so OpenAI trained the largest model they could afford A law governing how fast the model can generate tokens at different sizes

Chapter 6: Benchmark Results

GPT-3 was evaluated on over 30 benchmarks spanning language modeling, question answering, translation, reading comprehension, common sense reasoning, and more. The results told a consistent story: few-shot performance improves dramatically with scale.

Language modeling

On the LAMBADA benchmark (predict the last word of a passage requiring long-range context), GPT-3 set a new state of the art:

Model	LAMBADA Acc	Method
GPT-2	63.2%	Zero-shot
BERT-Large	52.0%	Fine-tuned
GPT-3 175B	86.4%	Few-shot
GPT-3 175B	76.2%	Zero-shot

The LAMBADA result is particularly striking because GPT-3's few-shot approach requires no task-specific training at all — just a handful of examples in the prompt.

Question answering

On closed-book QA (no external documents, the model must answer from memory), GPT-3 was competitive with systems that had access to retrieval databases:

Benchmark	Previous SOTA	GPT-3 Few-Shot	GPT-3 Zero-Shot
TriviaQA	68.0% (open-book)	71.2%	64.3%
NaturalQuestions	36.6% (closed-book)	29.9%	14.6%
WebQuestions	37.4% (closed-book)	41.5%	14.4%

The model as a knowledge base. On TriviaQA, GPT-3's few-shot score (71.2%) surpasses the previous best system that used a full retrieval pipeline (68.0%). This means the model's 175B parameters contain enough factual knowledge from training to outperform a dedicated search system. The weights themselves are the database.

Translation

GPT-3 can translate between languages despite being trained predominantly on English text. Its few-shot translation performance varies by language pair:

Direction	GPT-3 Few-Shot BLEU	Supervised SOTA BLEU
Fr → En	32.6	39.9
De → En	29.7	40.2
Ro → En	21.0	39.9
En → Fr	21.2	45.6
En → De	24.3	41.2

Two patterns emerge: (1) translation INTO English is better than OUT of English (because the model saw more English text), and (2) there's still a large gap to supervised systems that are trained specifically on parallel corpora. But the fact that any translation works at all from an English-predominant LM is remarkable.

Arithmetic

GPT-3 can perform arithmetic — poorly, but measurably. This is notable because the model was never explicitly trained on arithmetic:

Task	GPT-3 175B Accuracy
2-digit addition	100%
3-digit addition	80%
4-digit addition	25%
5-digit addition	9%
2-digit subtraction	98%
2-digit multiplication	29%

Arithmetic is learned, not programmed. The model has no calculator module. It "computes" by pattern-matching against arithmetic examples seen during pre-training. This explains the accuracy drop with more digits — 5-digit addition requires carrying through 5 positions, which the attention mechanism struggles with. Later work (Chain-of-Thought prompting) found that asking the model to show its work dramatically improves arithmetic accuracy.

GPT-3 Benchmark Dashboard

Click a benchmark category to see GPT-3's performance across model sizes. Watch how accuracy improves with scale — sometimes smoothly, sometimes with sharp jumps. The dashed line shows prior SOTA.

On TriviaQA, GPT-3's few-shot score (71.2%) surpassed a previous best system that used retrieval (68.0%). What does this tell us about what GPT-3's parameters contain?

The 175B parameters store enough factual knowledge from the 300B-token training set to function as a knowledge base — the model retrieves facts from its weights rather than from an external database GPT-3 has a hidden retrieval module that searches the internet GPT-3 memorized the TriviaQA test set during pre-training

Chapter 7: Limitations & Biases

GPT-3 is powerful, but its failures are just as instructive as its successes. Brown et al. were remarkably transparent about the model's shortcomings.

Repetition and coherence

When generating long text, GPT-3 tends to repeat itself — cycling back to phrases, ideas, or even exact sentences it generated earlier. This is a fundamental limitation of autoregressive models: each token is generated independently given the left context, and there's no global planning or outline.

At the document level, GPT-3 often loses the thread. It can write excellent individual paragraphs, but a 10-paragraph essay might contradict itself, change topics without transition, or repeat the introduction. The model has no concept of "what's my overall argument?" — it just predicts the next token.

Reasoning failures

Despite impressive benchmark results, GPT-3 fails at tasks requiring systematic reasoning:

Task Type	Failure Pattern	Example
Common sense physics	Ignores physical constraints	"If I put cheese in the fridge, where is the cheese?" → sometimes wrong
Causal reasoning	Confuses correlation with causation	Can't distinguish "A causes B" from "A correlates with B"
Negation	Ignores "not" in prompts	"Write a story that does NOT involve dragons" → writes about dragons
Multi-step logic	Loses track of steps	5-step deductions fail even when 2-step versions succeed

Social biases

GPT-3 inherits biases from its training data. Brown et al. conducted extensive bias analyses:

Gender bias: The model strongly associates certain occupations with specific genders. When completing "The nurse said that [pronoun]…", the model overwhelmingly uses "she." For "The CEO said…", it predominantly uses "he." These reflect statistical patterns in the training data, not reality.

Racial bias: In sentence completions about different racial groups, the model produces more negative sentiment for some groups than others. The sentiment gap is real and measurable.

Religious bias: The word "Muslim" in a prompt produces disproportionately more violent or terrorism-related completions compared to other religious groups. This reflects the biased coverage in internet text, not any ground truth.

Bias is not a bug — it's a mirror. GPT-3's biases reflect the statistical patterns in 300 billion tokens of internet text. The internet is not a neutral record of the world; it overrepresents certain viewpoints, demographics, and stereotypes. A model that perfectly learns these statistics will perfectly reproduce these biases. This is why post-training alignment (RLHF, constitutional AI) became critical in later models.

Environmental cost

Training GPT-3 consumed an estimated 3.15 × 10²³ FLOPs. Translated to energy:

python
# Environmental cost estimate
flops = 3.15e23
gpu_hours = 729_000     # V100-equivalent hours
watts_per_gpu = 300     # V100 TDP
kwh = gpu_hours * watts_per_gpu / 1000   # ≈ 219 MWh
co2_kg = kwh * 0.429    # US average grid intensity
print(f"{co2_kg/1000:.0f} metric tons CO₂")  # ≈ 94 tons
# For comparison: one transatlantic flight ≈ 1.6 tons per passenger
# GPT-3 training ≈ 59 transatlantic flights

Bias Detection Simulator

This simulator shows how GPT-3's occupation completions differ by gender. Each bar shows the probability of a gendered pronoun following "The [occupation] said that [pronoun]...". Notice how strongly certain occupations are associated with specific genders in the model's learned distribution.

nurse — she: 88%, he: 12%

What GPT-3 can't do

Brown et al. explicitly identified several failure modes:

1. Cannot learn new tasks from natural language instructions alone (zero-shot is unreliable).

2. Cannot perform well on tasks requiring precise symbolic manipulation (long arithmetic, formal logic).

3. Cannot reliably refuse to generate harmful content.

4. Cannot cite sources — it generates plausible-sounding text that may be factually wrong.

5. Cannot update its knowledge — its knowledge is frozen at training time.

Why does GPT-3 exhibit gender bias in occupation-related completions (e.g., "nurse" → "she", "CEO" → "he")?

Because GPT-3 was programmed with gender stereotypes Because the model learned statistical correlations from 300B tokens of internet text, which overrepresents certain gender-occupation associations. The model mirrors training data biases, not ground truth Because the model is too small to learn unbiased representations

Chapter 8: GPT-3 Explorer

Time to put it all together. This interactive explorer lets you experience GPT-3's in-context learning, see how scale affects capability, and experiment with different prompt strategies.

In-Context Learning Playground

Build a few-shot prompt step by step. Add examples (each one is a demonstration), then see how the simulated model's confidence changes. Toggle between model sizes to see how scale affects in-context learning. The attention visualization shows which examples the model attends to most when making its prediction.

Model Size 175B

As you add more examples, notice three things:

1. Confidence increases — the model becomes more certain about the pattern with more demonstrations.

2. Larger models benefit more — the gap between 175B and 1.3B widens as you add examples. Small models can't effectively use the additional context.

3. Attention concentrates — the model's attention focuses on the most relevant examples, especially those closest to the test input in terms of surface features.

Scaling Across Tasks

This multi-task dashboard shows how GPT-3 performs across different capability categories as model size increases. Click tasks to add/remove them from the plot. Watch for emergent abilities — capabilities that suddenly appear at specific scale thresholds.

The key takeaway from this explorer: In-context learning is not magic — it's pattern recognition at scale. Small models can recognize simple patterns (sentiment). Large models can recognize complex patterns (multi-step reasoning, code generation). The "intelligence" emerges from the interaction between massive parametric knowledge and the pattern structure of the prompt.

As you add more few-shot examples to a GPT-3 prompt, what happens to the performance gap between the 175B and 1.3B model?

The gap shrinks because both models learn the same pattern The gap widens — larger models benefit MORE from additional examples because they can better identify and apply complex patterns from in-context demonstrations The gap stays constant regardless of example count

Chapter 9: Connections

GPT-3 sits at a pivotal point in the history of NLP and AI. Understanding its connections to prior and subsequent work illuminates why it mattered.

What came before

Paper	Contribution	Relationship to GPT-3
Attention Is All You Need (2017)	The Transformer architecture	GPT-3's architecture — scaled to 96 layers
BERT (2018)	Bidirectional pre-training	The "fine-tune everything" paradigm GPT-3 challenged
GPT-2 (2019)	Showed language models can multitask	GPT-3 is GPT-2 scaled 100x with rigorous evaluation
Scaling Laws (Kaplan 2020)	Power law predictions	Justified the investment in 175B parameters

What came after

Paper	How It Extended GPT-3
Chain-of-Thought (Wei 2022)	Improved GPT-3's reasoning by prompting it to "show its work"
InstructGPT (Ouyang 2022)	Fine-tuned GPT-3 with RLHF — the foundation for ChatGPT
Chinchilla (Hoffmann 2022)	Challenged GPT-3's scaling allocation — argued for more data, fewer params
LoRA (Hu 2022)	Efficient fine-tuning of GPT-3-scale models
PaLM (Chowdhery 2022)	Google's 540B model — continued the scaling direction

The paradigm shift

GPT-3 catalyzed a fundamental shift in how we think about AI systems:

Before GPT-3:

One model per task. Collect data → train → deploy → repeat.

ML engineers design features.

Users can't interact with models directly.

After GPT-3:

One model, many tasks. Write a prompt → get an answer.

Users design prompts.

Natural language is the interface.

The GPT-3 legacy: GPT-3 proved that scale is a path to generality. By making a model large enough and training it on enough data, you don't need to engineer task-specific solutions — general capabilities emerge. This insight drove the entire field toward larger models, more data, and better training methods. Every "foundation model" since — PaLM, LLaMA, Claude, Gemini — traces its lineage to GPT-3's demonstration that scale works.

"What I cannot predict from small-scale experiments, I do not understand well enough to scale." — paraphrasing Feynman, via Kaplan et al.

What is the most lasting paradigm shift introduced by GPT-3?

The shift from "one model per task" (collect data, fine-tune, deploy) to "one model, many tasks" (write a prompt), making natural language itself the programming interface for AI The use of Transformer architecture for NLP tasks The use of web scraping for training data collection

GPT-3: Few-Shot Learners