Language Models are Few-Shot Learners — scaling to 175 billion parameters unlocks in-context learning: just show examples in the prompt, no gradient updates needed.
Imagine you hire two translators. Translator A studied French for one year in school — they know the grammar rules, the common vocab, but when you hand them a legal contract, they stumble. Translator B grew up bilingual, read thousands of books, lived in Paris for a decade — you hand them the same contract and they nail it without any special preparation. What's the difference? Translator B has so much experience that novel tasks become easy.
Before GPT-3, the standard recipe for NLP was: pre-train a language model on lots of text, then fine-tune it on a labeled dataset for each specific task. Want sentiment analysis? Fine-tune on sentiment labels. Want translation? Fine-tune on parallel corpora. Want question answering? Fine-tune on QA pairs. Each task required its own labeled dataset, its own training run, and its own deployed model.
This works, but it has three fundamental problems:
| Problem | Why It Hurts |
|---|---|
| Data dependency | Every new task needs thousands of labeled examples. For rare or specialized tasks, these may not exist. |
| Narrow generalization | A model fine-tuned on sentiment can't do translation. You need a separate model (and separate compute) for each task. |
| Spurious correlations | Fine-tuning on small datasets lets models exploit shortcuts — learning "negation words = negative sentiment" rather than understanding meaning. |
GPT-3 asked a radical question: what if we just make the model big enough that it can do tasks without any fine-tuning at all?
The result was startling. GPT-3, with no task-specific training at all, could translate languages, answer trivia, write code, solve analogies, and generate coherent articles — just from being shown a few examples in the prompt. On some benchmarks, this "few-shot" approach matched or beat models that had been explicitly fine-tuned on thousands of labeled examples.
This wasn't just an incremental improvement. It was a paradigm shift. Instead of "collect data → train model → deploy model" for every task, you could just write a prompt. The model became a general-purpose tool that could be steered with language itself.
Drag the slider to scale up model parameters. Watch how capabilities emerge at different thresholds. Small models can barely do anything without fine-tuning. As you approach 175B, few-shot performance on diverse tasks suddenly becomes viable.
GPT-3's most important contribution isn't its size — it's the discovery that large language models can learn tasks from examples embedded in the prompt, without any weight updates. This is in-context learning (ICL).
Here's how it works. Instead of fine-tuning the model on a labeled dataset, you construct a prompt that demonstrates the task with a few examples, then ask the model to continue the pattern:
prompt # Few-shot prompt for sentiment classification Review: "This movie was absolutely fantastic!" Sentiment: Positive Review: "Terrible acting, boring plot." Sentiment: Negative Review: "A masterpiece of modern cinema." Sentiment:
The model has never been trained on sentiment labels. It has never seen a loss function that says "positive" or "negative." Yet it completes the prompt with "Positive" — because it has learned the pattern from the examples in the prompt and can continue it.
Brown et al. defined three levels of in-context learning:
| Setting | Examples in Prompt | Description |
|---|---|---|
| Zero-shot | 0 | Just the task instruction: "Translate English to French: cheese →" |
| One-shot | 1 | One example + the new input |
| Few-shot | 10–100 | Multiple examples + the new input (limited only by context window) |
The critical finding: performance scales smoothly with both model size and number of examples. More examples help, and bigger models benefit more from additional examples. A 175B-parameter model with 50 examples can outperform a 1.3B model with the same examples by a large margin — the larger model is simply better at identifying and applying the pattern.
This is still an active research question. The leading hypotheses:
1. Meta-learning during pre-training. During pre-training on internet text, the model encounters many naturally occurring "few-shot" patterns. Blog posts that define a term and then use it. Translation examples in language-learning websites. Q&A forums where multiple questions are answered in sequence. The model implicitly learns to recognize and continue these patterns.
2. Induction heads. Olsson et al. (2022) identified specific attention patterns called induction heads — pairs of attention heads that implement a "match and copy" behavior. One head finds a previous occurrence of the current token, another copies the token that followed. This mechanism can identify patterns in few-shot examples and apply them.
3. Implicit gradient descent. Akyürek et al. (2023) showed that Transformers can implement gradient descent internally — the forward pass through the attention layers implicitly performs optimization steps on the in-context examples. The model is effectively "training itself" on your examples during inference.
Watch how adding more examples in the prompt improves GPT-3's accuracy on a task. Each bar shows model accuracy. Drag the slider to add examples and see performance improve. Notice how larger models benefit more from additional examples.
ICL fundamentally changes the interface between humans and AI. Instead of writing code or collecting labeled data, you write prompts. The prompt is the program, and the model is the interpreter. This insight — that language itself can serve as a programming language for neural networks — is arguably the most consequential idea in GPT-3.
python # Traditional ML: collect data, train model dataset = load_dataset("sentiment") # need labeled data model = train(base_model, dataset, epochs=3) # need GPU hours result = model.predict("Great movie!") # single task # GPT-3 ICL: just write a prompt prompt = """Review: "Loved it!" → Positive Review: "Hated it." → Negative Review: "Great movie!" →""" result = gpt3.complete(prompt) # no training needed # Returns: " Positive"
To understand why few-shot learning matters, you need to see exactly how it compares to the alternatives. Brown et al. set up a systematic comparison across dozens of benchmarks.
The key comparison is between three regimes:
The headline result: GPT-3 few-shot sometimes matches or exceeds fine-tuned BERT-Large, despite having never been trained on the task-specific data. This is remarkable because BERT-Large was explicitly fine-tuned on thousands of labeled examples for each task.
| Task | BERT-Large FT | GPT-3 Few-Shot | GPT-3 Zero-Shot |
|---|---|---|---|
| TriviaQA | — | 71.2% | 64.3% |
| LAMBADA | — | 86.4% | 76.2% |
| StoryCloze | 87.4% | 87.7% | 83.2% |
| SuperGLUE | 69.0% | 71.8% | 58.9% |
Few-shot wins when:
1. The task is common in internet text (translation, Q&A, summarization).
2. The label space is simple (positive/negative, yes/no, a named entity).
3. You have very little labeled data (< 100 examples).
4. You need to switch tasks rapidly (many tasks, single model).
Fine-tuning wins when:
1. You have large labeled datasets (> 10K examples).
2. The task requires precise, structured outputs (NER, parsing).
3. Maximum accuracy is critical (medical, legal applications).
4. The domain is specialized and rare in pre-training data.
Drag the slider to change the number of available labeled examples. The chart shows when few-shot (orange) overtakes fine-tuning (teal) as labeled data decreases. At very few examples, fine-tuning overfits and few-shot wins.
One critical concern: is GPT-3's test performance genuine or did it see the test data during pre-training? With 300 billion tokens of training data scraped from the internet, some benchmark test sets inevitably leaked into the training corpus.
Brown et al. studied this by measuring overlap between their training data and benchmark test sets. They found contamination on some benchmarks but argued that removing contaminated examples had minimal effect on most results. Still, this concern foreshadowed a major issue in the field: as training corpora grow, maintaining clean evaluation becomes increasingly difficult.
python # Data contamination check: n-gram overlap between train and test def check_contamination(train_docs, test_example, n=13): """Check if any 13-gram in test_example appears in training data.""" test_ngrams = set(get_ngrams(test_example, n)) for doc in train_docs: train_ngrams = set(get_ngrams(doc, n)) overlap = test_ngrams & train_ngrams if len(overlap) / len(test_ngrams) > 0.7: return True # likely contaminated return False
GPT-3 is architecturally simple — it's the same decoder-only Transformer as GPT-2, just scaled up massively. The core insight is that the architecture doesn't need to change; you just need more of it.
The full model family spans four orders of magnitude:
| Model | Params | Layers | dmodel | Heads | dhead | Context |
|---|---|---|---|---|---|---|
| GPT-3 Small | 125M | 12 | 768 | 12 | 64 | 2048 |
| GPT-3 Medium | 350M | 24 | 1024 | 16 | 64 | 2048 |
| GPT-3 Large | 760M | 24 | 1536 | 16 | 96 | 2048 |
| GPT-3 XL | 1.3B | 24 | 2048 | 24 | 128 | 2048 |
| GPT-3 2.7B | 2.7B | 32 | 2560 | 32 | 80 | 2048 |
| GPT-3 6.7B | 6.7B | 32 | 4096 | 32 | 128 | 2048 |
| GPT-3 13B | 13B | 40 | 5140 | 40 | 128 | 2048 |
| GPT-3 175B | 175B | 96 | 12288 | 96 | 128 | 2048 |
Each of the 96 layers follows the standard GPT-2 pattern with one key modification:
GPT-3 uses pre-norm (LayerNorm before each sub-layer) rather than post-norm (LayerNorm after). This was found to be more stable for training very deep networks — with post-norm, gradients in the first layers can explode or vanish when the network is 96 layers deep.
The sparse attention is another key architectural detail. Every other layer uses a locally banded attention pattern — each token attends only to its nearby neighbors (within a window) rather than all previous tokens. This reduces the quadratic cost of attention while still allowing long-range information to propagate through the dense layers.
python # Where do 175B parameters come from? d = 12288 # hidden dimension L = 96 # layers V = 50257 # vocab size (BPE tokens) ctx = 2048 # context length # Token embeddings: V × d emb = V * d # ≈ 617M # Position embeddings: ctx × d pos = ctx * d # ≈ 25M # Per-layer attention: 4 × d² (Q, K, V, output projections) attn_per_layer = 4 * d * d # ≈ 604M # Per-layer FFN: 2 × d × 4d (up + down projections) ffn_per_layer = 2 * d * (4 * d) # ≈ 1.21B # Per-layer LayerNorm: 2 × 2d (2 norms, each with scale + bias) ln_per_layer = 2 * 2 * d # ≈ 49K (negligible) total = emb + pos + L * (attn_per_layer + ffn_per_layer + ln_per_layer) print(f"Total: {total/1e9:.1f}B") # ≈ 174.6B ✓
Click through the model's layers. Each layer shows attention (teal) and FFN (orange) with their parameter counts. The bar on the right shows total parameter distribution. Click a layer to see its tensor shapes.
A 175B-parameter model is only as good as the data it learns from. GPT-3 was trained on approximately 300 billion tokens from a carefully curated mix of internet data.
| Dataset | Tokens (B) | Weight in Mix | Epochs | Description |
|---|---|---|---|---|
| Common Crawl (filtered) | 410 | 60% | 0.44 | Filtered web text — 45 TB raw → 570 GB after quality filtering |
| WebText2 | 19 | 22% | 2.9 | Expanded GPT-2 training set — Reddit links with ≥3 upvotes |
| Books1 | 12 | 8% | 1.9 | Internet-based books corpus |
| Books2 | 55 | 8% | 0.43 | Internet-based books corpus |
| Wikipedia | 3 | 3% | 3.4 | English Wikipedia |
Raw Common Crawl is extremely noisy — full of spam, boilerplate, duplicates, and garbage. OpenAI applied a multi-step filtering pipeline:
The quality classifier is a simple but effective trick. By training a binary classifier to distinguish "curated web text" (Reddit-upvoted links) from "random web text" (Common Crawl), OpenAI created an automatic quality filter. Documents that look like something a human found worth sharing get higher scores. This filtered Common Crawl from 45 TB of raw text down to 570 GB — keeping only about 1.3%.
GPT-3 uses the same byte-level BPE (Byte Pair Encoding) tokenizer as GPT-2, with a vocabulary of 50,257 tokens. The tokenizer works at the byte level, so it can represent any text (including Unicode) without unknown tokens.
python # GPT-3 tokenization example import tiktoken enc = tiktoken.get_encoding("gpt2") # GPT-3 uses same tokenizer text = "Language models are few-shot learners." tokens = enc.encode(text) print(tokens) # [Languange, models, are, few, -, shot, learners, .] print(len(tokens)) # 8 tokens for 6 words # Each token = integer ID in [0, 50256] # Context window = 2048 tokens ≈ 1500 words # Few-shot examples must fit within this window!
This visualization shows the proportion of each data source in GPT-3's training mixture. The left bars show raw token counts; the right bars show sampling weights. Notice how quality sources are upsampled relative to their size.
Perhaps the most impactful finding from the GPT-3 work (and its predecessor, Kaplan et al. 2020) is that language model performance follows remarkably smooth power laws as you scale up model size, data, and compute.
A power law takes the form:
Where L is the loss, N is the number of parameters, Nc is a constant, and α is the scaling exponent. When you plot loss vs parameters on a log-log scale, you get a straight line. This means every 10x increase in parameters gives a predictable decrease in loss.
Kaplan et al. (2020) found three independent scaling relationships:
| Scaling Axis | Power Law | Exponent α | Implication |
|---|---|---|---|
| Parameters (N) | L(N) ∝ N-0.076 | 0.076 | 10x more params → loss drops by ~17% |
| Data (D) | L(D) ∝ D-0.095 | 0.095 | 10x more data → loss drops by ~20% |
| Compute (C) | L(C) ∝ C-0.050 | 0.050 | 10x more compute → loss drops by ~11% |
The key insight: parameters scale more efficiently than data, but data is cheaper. The optimal allocation of a fixed compute budget is roughly 5x more on parameters than data. (Later, the Chinchilla paper challenged this, arguing for roughly equal scaling.)
Given a fixed compute budget C (measured in FLOPs), how should you divide it between model size N and training data D? The relationship is:
Where C is in FLOPs, N is parameters, and D is training tokens. The factor of 6 comes from: ~2 FLOPs per parameter per token for the forward pass, and ~4 FLOPs per parameter per token for the backward pass (2 for gradient computation, 2 for gradient accumulation).
python # Compute budget for GPT-3 training N = 175e9 # 175 billion parameters D = 300e9 # 300 billion tokens C = 6 * N * D # ≈ 3.15 × 10²³ FLOPs # In GPU-hours (V100 at ~120 TFLOPS mixed precision) v100_flops = 120e12 # FLOPS per second hours = C / v100_flops / 3600 print(f"{hours/1000:.0f}K V100-hours") # ≈ 729K V100-hours # At ~$1/GPU-hour, that's ~$729K in compute # But with pipeline parallelism, wall-clock time = ~1 month # on a cluster of ~1000 V100s
This log-log plot shows loss vs parameters for GPT-3's model family. The straight line on a log-log plot means a power law. Each dot is one of the 8 model sizes. Drag the slider to extend the line and predict performance at even larger scales.
One of the most fascinating aspects of scaling is the emergence of qualitatively new abilities at specific scale thresholds. These aren't predicted by the smooth power law of loss — they appear suddenly:
1. In-context learning: Barely works at 1.3B, works well at 13B, works reliably at 175B.
2. Arithmetic: Two-digit addition at 6.7B, three-digit at 175B.
3. Word unscrambling: Impossible below 6.7B, reasonable at 175B.
4. Code generation: Barely functional below 13B, surprisingly capable at 175B.
GPT-3 was evaluated on over 30 benchmarks spanning language modeling, question answering, translation, reading comprehension, common sense reasoning, and more. The results told a consistent story: few-shot performance improves dramatically with scale.
On the LAMBADA benchmark (predict the last word of a passage requiring long-range context), GPT-3 set a new state of the art:
| Model | LAMBADA Acc | Method |
|---|---|---|
| GPT-2 | 63.2% | Zero-shot |
| BERT-Large | 52.0% | Fine-tuned |
| GPT-3 175B | 86.4% | Few-shot |
| GPT-3 175B | 76.2% | Zero-shot |
The LAMBADA result is particularly striking because GPT-3's few-shot approach requires no task-specific training at all — just a handful of examples in the prompt.
On closed-book QA (no external documents, the model must answer from memory), GPT-3 was competitive with systems that had access to retrieval databases:
| Benchmark | Previous SOTA | GPT-3 Few-Shot | GPT-3 Zero-Shot |
|---|---|---|---|
| TriviaQA | 68.0% (open-book) | 71.2% | 64.3% |
| NaturalQuestions | 36.6% (closed-book) | 29.9% | 14.6% |
| WebQuestions | 37.4% (closed-book) | 41.5% | 14.4% |
GPT-3 can translate between languages despite being trained predominantly on English text. Its few-shot translation performance varies by language pair:
| Direction | GPT-3 Few-Shot BLEU | Supervised SOTA BLEU |
|---|---|---|
| Fr → En | 32.6 | 39.9 |
| De → En | 29.7 | 40.2 |
| Ro → En | 21.0 | 39.9 |
| En → Fr | 21.2 | 45.6 |
| En → De | 24.3 | 41.2 |
Two patterns emerge: (1) translation INTO English is better than OUT of English (because the model saw more English text), and (2) there's still a large gap to supervised systems that are trained specifically on parallel corpora. But the fact that any translation works at all from an English-predominant LM is remarkable.
GPT-3 can perform arithmetic — poorly, but measurably. This is notable because the model was never explicitly trained on arithmetic:
| Task | GPT-3 175B Accuracy |
|---|---|
| 2-digit addition | 100% |
| 3-digit addition | 80% |
| 4-digit addition | 25% |
| 5-digit addition | 9% |
| 2-digit subtraction | 98% |
| 2-digit multiplication | 29% |
Click a benchmark category to see GPT-3's performance across model sizes. Watch how accuracy improves with scale — sometimes smoothly, sometimes with sharp jumps. The dashed line shows prior SOTA.
GPT-3 is powerful, but its failures are just as instructive as its successes. Brown et al. were remarkably transparent about the model's shortcomings.
When generating long text, GPT-3 tends to repeat itself — cycling back to phrases, ideas, or even exact sentences it generated earlier. This is a fundamental limitation of autoregressive models: each token is generated independently given the left context, and there's no global planning or outline.
At the document level, GPT-3 often loses the thread. It can write excellent individual paragraphs, but a 10-paragraph essay might contradict itself, change topics without transition, or repeat the introduction. The model has no concept of "what's my overall argument?" — it just predicts the next token.
Despite impressive benchmark results, GPT-3 fails at tasks requiring systematic reasoning:
| Task Type | Failure Pattern | Example |
|---|---|---|
| Common sense physics | Ignores physical constraints | "If I put cheese in the fridge, where is the cheese?" → sometimes wrong |
| Causal reasoning | Confuses correlation with causation | Can't distinguish "A causes B" from "A correlates with B" |
| Negation | Ignores "not" in prompts | "Write a story that does NOT involve dragons" → writes about dragons |
| Multi-step logic | Loses track of steps | 5-step deductions fail even when 2-step versions succeed |
GPT-3 inherits biases from its training data. Brown et al. conducted extensive bias analyses:
Gender bias: The model strongly associates certain occupations with specific genders. When completing "The nurse said that [pronoun]…", the model overwhelmingly uses "she." For "The CEO said…", it predominantly uses "he." These reflect statistical patterns in the training data, not reality.
Racial bias: In sentence completions about different racial groups, the model produces more negative sentiment for some groups than others. The sentiment gap is real and measurable.
Religious bias: The word "Muslim" in a prompt produces disproportionately more violent or terrorism-related completions compared to other religious groups. This reflects the biased coverage in internet text, not any ground truth.
Training GPT-3 consumed an estimated 3.15 × 10²³ FLOPs. Translated to energy:
python # Environmental cost estimate flops = 3.15e23 gpu_hours = 729_000 # V100-equivalent hours watts_per_gpu = 300 # V100 TDP kwh = gpu_hours * watts_per_gpu / 1000 # ≈ 219 MWh co2_kg = kwh * 0.429 # US average grid intensity print(f"{co2_kg/1000:.0f} metric tons CO₂") # ≈ 94 tons # For comparison: one transatlantic flight ≈ 1.6 tons per passenger # GPT-3 training ≈ 59 transatlantic flights
This simulator shows how GPT-3's occupation completions differ by gender. Each bar shows the probability of a gendered pronoun following "The [occupation] said that [pronoun]...". Notice how strongly certain occupations are associated with specific genders in the model's learned distribution.
Brown et al. explicitly identified several failure modes:
1. Cannot learn new tasks from natural language instructions alone (zero-shot is unreliable).
2. Cannot perform well on tasks requiring precise symbolic manipulation (long arithmetic, formal logic).
3. Cannot reliably refuse to generate harmful content.
4. Cannot cite sources — it generates plausible-sounding text that may be factually wrong.
5. Cannot update its knowledge — its knowledge is frozen at training time.
Time to put it all together. This interactive explorer lets you experience GPT-3's in-context learning, see how scale affects capability, and experiment with different prompt strategies.
Build a few-shot prompt step by step. Add examples (each one is a demonstration), then see how the simulated model's confidence changes. Toggle between model sizes to see how scale affects in-context learning. The attention visualization shows which examples the model attends to most when making its prediction.
As you add more examples, notice three things:
1. Confidence increases — the model becomes more certain about the pattern with more demonstrations.
2. Larger models benefit more — the gap between 175B and 1.3B widens as you add examples. Small models can't effectively use the additional context.
3. Attention concentrates — the model's attention focuses on the most relevant examples, especially those closest to the test input in terms of surface features.
This multi-task dashboard shows how GPT-3 performs across different capability categories as model size increases. Click tasks to add/remove them from the plot. Watch for emergent abilities — capabilities that suddenly appear at specific scale thresholds.
GPT-3 sits at a pivotal point in the history of NLP and AI. Understanding its connections to prior and subsequent work illuminates why it mattered.
| Paper | Contribution | Relationship to GPT-3 |
|---|---|---|
| Attention Is All You Need (2017) | The Transformer architecture | GPT-3's architecture — scaled to 96 layers |
| BERT (2018) | Bidirectional pre-training | The "fine-tune everything" paradigm GPT-3 challenged |
| GPT-2 (2019) | Showed language models can multitask | GPT-3 is GPT-2 scaled 100x with rigorous evaluation |
| Scaling Laws (Kaplan 2020) | Power law predictions | Justified the investment in 175B parameters |
| Paper | How It Extended GPT-3 |
|---|---|
| Chain-of-Thought (Wei 2022) | Improved GPT-3's reasoning by prompting it to "show its work" |
| InstructGPT (Ouyang 2022) | Fine-tuned GPT-3 with RLHF — the foundation for ChatGPT |
| Chinchilla (Hoffmann 2022) | Challenged GPT-3's scaling allocation — argued for more data, fewer params |
| LoRA (Hu 2022) | Efficient fine-tuning of GPT-3-scale models |
| PaLM (Chowdhery 2022) | Google's 540B model — continued the scaling direction |
GPT-3 catalyzed a fundamental shift in how we think about AI systems:
Before GPT-3:
One model per task. Collect data → train → deploy → repeat.
ML engineers design features.
Users can't interact with models directly.
After GPT-3:
One model, many tasks. Write a prompt → get an answer.
Users design prompts.
Natural language is the interface.
"What I cannot predict from small-scale experiments, I do not understand well enough to scale." — paraphrasing Feynman, via Kaplan et al.