From Cold War codebreakers to ChatGPT — the 75-year journey from word lookups to foundation models.
You speak about 16,000 words per day. Each one carries meaning, context, ambiguity. Right now, reading this sentence, your brain is performing dozens of operations simultaneously: parsing syntax, resolving pronoun references, inferring tone, predicting what comes next. You do this effortlessly, without thinking about it, thousands of times before lunch.
Now imagine building a machine that handles all of that.
Language is humanity's superpower. Christopher Manning calls it the technology that lets us "network human brains together." A single sentence like "The bank refused the loan because it was too risky" requires you to know that "it" refers to the loan, not the bank — and that "bank" here means a financial institution, not a riverbank. This kind of reasoning, which children master by age five, has stumped AI researchers for over seventy years.
Consider the word "bank." In isolation, it could mean a financial institution, a river's edge, a pool shot, or the act of tilting an airplane. Humans disambiguate instantly from context. Machines? For decades, this single problem — word sense disambiguation — was an entire subfield of NLP research.
Or take the sentence "I saw her duck." Is she crouching, or do you see her pet waterfowl? The ambiguity isn't just about words. It's structural: the sentence has two completely valid grammatical parses, each producing a different meaning. This is called syntactic ambiguity, and it haunts every NLP system ever built.
This lesson traces the 75-year arc of NLP — from naive word-by-word translation in the 1950s, through hand-built rule systems in the 1970s, statistical methods in the 1990s, and the neural revolution that produced GPT and BERT. Along the way, we'll grapple with a question that Manning poses directly: do these models actually understand language, or are they just very sophisticated pattern matchers?
Click a sentence to see how context determines word meaning. The highlighted word is the same in every sentence — but context changes everything.
January 7, 1954. Georgetown University and IBM publicly demonstrate a machine that translates Russian into English. The system handles 60 sentences using a vocabulary of just 250 words and 6 grammar rules. The press goes wild. The New York Times reports that fully automatic translation is three to five years away.
It wasn't. It was decades away. But let's understand why that demo was so electrifying.
The Cold War is at its peak. The Soviet Union is producing scientific literature at a staggering rate — papers on physics, chemistry, rocketry — and almost none of it is being read in the West because there aren't enough Russian-English translators. Warren Weaver, a Rockefeller Foundation director, writes a famous memo in 1949 proposing that translation is fundamentally a problem of cryptography: Russian text is just English "written in a strange cipher." If we broke Enigma, surely we can break Russian.
The Georgetown-IBM system was essentially a dictionary lookup with morphological rules. For each Russian word, find the English equivalent. Apply a few reordering rules because Russian and English put words in different orders. Output the result.
This approach is called direct translation or word-by-word translation. For some sentences, it works tolerably well. "The student reads the book" translates word-for-word between many language pairs. But language is not that cooperative.
Consider translating the German sentence "Er hat den Mann, der den Hund geschlagen hat, gesehen" word-by-word into English. You get something like: "He has the man who the dog beaten has seen." The actual meaning is: "He saw the man who beat the dog." The words are all there, but the structure is mangled. German puts verbs at the end of subordinate clauses. English doesn't. No amount of dictionary lookup can fix this — you need to understand sentence structure.
Or consider idioms. Translating the French "il pleut des cordes" word-by-word gives "it rains ropes." The meaning: "it's raining cats and dogs." Languages encode the same ideas using completely different metaphors. A dictionary can't bridge that gap.
By the early 1960s, machine translation research had consumed millions of dollars with underwhelming results. In 1966, the National Academy of Sciences commissioned a review. The ALPAC report (Automatic Language Processing Advisory Committee) concluded that machine translation was "not practical" and that funding should be redirected to basic linguistic research. Translation funding was slashed almost overnight. The field entered a decade-long winter.
See how word-by-word translation fails. Click an example to compare literal (word-level) translation with the actual meaning. Notice how word order, idioms, and structure destroy the naive approach.
After the ALPAC crash, researchers pivoted. Instead of trying to translate entire languages, they asked a narrower question: can a computer understand language within a limited domain? Can it follow instructions, answer questions, carry out tasks — as long as we carefully define the world it operates in?
The answer was yes. And the most famous demonstration was Terry Winograd's SHRDLU (1972).
SHRDLU lived in a simulated world of colored blocks on a table. You could type natural English commands: "Pick up a big red block." "Put the blue pyramid on the block in the box." "Does the box contain anything?" And SHRDLU would do it, responding in English, tracking the state of its world, even answering questions about why it did something.
The system was breathtaking. It parsed complex, nested sentences. It resolved pronoun references ("put it on..." — which object is "it"?). It reasoned about spatial relationships. For a brief moment, it looked like the language understanding problem was nearly solved.
It wasn't. SHRDLU worked because its world had exactly one table, a handful of blocks, and about 50 vocabulary words. The linguistic knowledge was painstakingly hand-coded as rules: "If the user says 'pick up X,' find the object matching X, check if the gripper is free, move the gripper to X, close the gripper." Every possible sentence pattern required a rule. Every new object required new rules. Every new domain required starting from scratch.
Around the same time, William Woods built LUNAR (1978), a system that answered natural language questions about the chemical composition of lunar rock samples brought back by Apollo missions. "What is the average concentration of aluminum in high-silica rocks?" LUNAR would parse the question, convert it to a database query, execute it, and return the answer in English.
LUNAR worked well within its domain — about 90% accuracy on the questions scientists actually asked about moon rocks. But like SHRDLU, it was hand-built for exactly one dataset. Ask it about Mars rocks and it would crash, not because the chemistry was different, but because nobody had written rules for Mars-related vocabulary.
These systems shared a design philosophy that Manning emphasizes: separate declarative linguistic knowledge (grammar rules, word meanings, world facts) from procedural processing (parsing algorithms, inference engines). The idea was that if you got the grammar right, you could swap in different grammars for different languages or domains.
This pipeline — parse, interpret, act — was elegant. But it required human experts to write every rule, every word meaning, every grammar pattern. Manning notes that progress was real but "agonizingly slow." By the early 1990s, the biggest hand-built systems had thousands of rules and still couldn't handle open-domain text.
A tiny blocks world. Click a command to watch the system parse and execute it. Notice how even simple commands require parsing, reference resolution, and world-state tracking.
In the early 1990s, something shifted. Linguists had spent decades hand-crafting rules. Computer scientists had spent decades building narrow expert systems. And then a generation of researchers said: what if we stop trying to encode human knowledge, and instead let the data tell us the patterns?
The raw material was suddenly available. The internet was exploding. The Penn Treebank (1993) — a corpus of Wall Street Journal articles where every sentence had been manually annotated with its grammatical structure — gave researchers a shared benchmark. Digitized books, news archives, and web crawls provided billions of words of raw text. For the first time, NLP had data at scale.
The core idea of statistical NLP: instead of writing rules, count patterns. How often does the word "the" appear before "cat"? How often does "cat" appear as a noun vs. a verb? If you've seen "the cat sat on the ___," what word is most likely to fill the blank?
These questions can be answered by counting. Given a large enough corpus, you can estimate the probability of any word following any other word. This is called a language model — specifically, an n-gram language model, where n is the number of words you look at for context.
A bigram model predicts the next word using only the previous word. Simple, but surprisingly powerful. "I ate ___" — a bigram model trained on English text will assign high probability to "lunch," "dinner," "breakfast," and low probability to "elephant" or "theorem." It has learned something about how English works, without a single hand-written rule.
Statistical NLP produced a toolkit of models that dominated the field for 20 years:
| Model | Task | Key Idea |
|---|---|---|
| HMM (Hidden Markov Model) | Part-of-speech tagging | Words are observations; POS tags are hidden states |
| CRF (Conditional Random Field) | Named entity recognition | Globally optimal tag sequence, not greedy |
| MaxEnt (Maximum Entropy) | Text classification | Make no assumptions beyond observed features |
| PCFG (Probabilistic Context-Free Grammar) | Syntactic parsing | Grammar rules with probabilities |
All of these models shared a paradigm: a human expert designs features (is this word capitalized? does it end in "-tion"? is the previous word a verb?), and the algorithm learns the weights of those features from data. The features are hand-crafted; the weights are learned. Manning calls this the shift from "hand-craft rules" to "hand-craft features."
Statistical models dominated NLP benchmarks throughout the 2000s. Part-of-speech tagging accuracy reached 97%. Named entity recognition was commercially deployed (finding person names, company names, locations in news text). Machine translation improved steadily through phrase-based statistical MT, where the system learned to translate short phrases rather than individual words.
But statistical models had a fundamental limitation: they couldn't capture long-range dependencies. A bigram model doesn't know that "the cat that the dog that the rat bit chased ran away" is about a cat running. It can only see one word back. Even trigrams and 5-grams couldn't bridge the gap between "the cat" and "ran" when many words intervened.
Watch a bigram model generate text word by word. Each word is chosen based on the probability P(next | current). Click "Generate" to produce text, and notice how it's locally coherent but globally nonsensical.
In 2013, a paper from Google changed NLP forever. Tomas Mikolov and colleagues published Word2Vec, a method that represented every word as a dense vector in a continuous space — and these vectors captured meaning in ways nobody expected.
The idea itself wasn't new. Yoshua Bengio had proposed neural language models in 2003. Collobert and Weston had shown that neural networks could learn useful word representations in 2008. But Word2Vec was fast enough to train on billions of words, and the resulting vectors had a remarkable property.
Word2Vec learned a 300-dimensional vector for each word by training on a simple task: predict a word from its context (or predict context from a word). The resulting vectors captured semantic relationships as geometric relationships. Words with similar meanings clustered together. And relationships between words could be expressed as vector arithmetic:
This was astonishing. The model had never been told that kings and queens are related, or that gender is a dimension of meaning. It learned these relationships purely from patterns in text — from the fact that "king" and "queen" appear in similar contexts, modified by the same gender-related context shifts as "man" and "woman."
Once you have word vectors, you can feed them into neural networks. The field moved rapidly through several architectures:
| Year | Architecture | Key Innovation |
|---|---|---|
| 2013 | Word2Vec | Dense word vectors from context prediction |
| 2014 | GloVe | Global co-occurrence statistics + embedding |
| 2014 | Seq2Seq | Encoder-decoder for variable-length sequences |
| 2015 | Attention | Focus on relevant parts of input dynamically |
| 2017 | Transformer | Self-attention replaces recurrence entirely |
Recurrent Neural Networks (RNNs) could process sequences of any length by maintaining a hidden state that updated at each time step. But they struggled with long sequences — the signal from early words faded as it passed through many time steps (the vanishing gradient problem). LSTMs (Long Short-Term Memory networks) partially solved this with gating mechanisms, but were slow to train because each step depended on the previous one.
The Transformer (Vaswani et al., 2017) eliminated recurrence entirely. Instead of processing words one by one, it processed all words in parallel, using self-attention to let every word look at every other word directly. This was faster to train (parallelizable on GPUs) and captured long-range dependencies naturally. The Transformer became the foundation for everything that followed.
A simplified 2D view of word embeddings. Words with similar meanings cluster together. Click an analogy to see how vector arithmetic captures relationships between words.
2018 was a watershed year for NLP. Two models arrived that changed the field's trajectory permanently: BERT (from Google) and GPT (from OpenAI). Both were built on Transformers. Both were trained on massive amounts of text. But their key innovation wasn't architectural — it was the training paradigm.
Statistical and early neural NLP systems needed labeled data — sentences manually annotated with parts of speech, named entities, sentiment, or translations. Creating this data was expensive. The Penn Treebank, which took years to build, contained about 1 million words. Medical NER datasets might have only 10,000 labeled sentences. Each new task required a new labeled dataset.
The breakthrough insight: the text itself is the supervision.
BERT's training objective is beautifully simple. Take a sentence. Randomly replace 15% of the words with a special [MASK] token. Train the model to predict the original words. No human annotation needed — the original text provides the correct answers.
To predict the masked word correctly, BERT must learn grammar ("sat" requires a noun before it), semantics ("the ___ sat on the mat" suggests an animal), and world knowledge ("cats sit on mats more often than elephants do"). All from predicting masked words.
GPT takes an even simpler approach: given all previous words, predict the next one. This is exactly the language modeling objective from Era 3 — but with a Transformer instead of an n-gram model, and trained on orders of magnitude more data.
The key difference from bigram models: the Transformer can attend to the entire preceding context, not just the last one or two words. It can learn that "The president of France, who recently visited Germany and spoke with the chancellor about trade policy, ___" should be continued with a verb whose subject is "president." That's a dependency spanning 20 words — impossible for n-gram models, natural for Transformers.
Before BERT, each NLP task had its own model. Sentiment analysis? Train a classifier. Named entity recognition? Train a sequence labeler. Question answering? Train an entirely different model. Each from scratch.
BERT introduced the pretrain-finetune paradigm: pretrain one large model on self-supervised objectives, then fine-tune it on each specific task with a small amount of labeled data. The pretrained model already "knows" English; fine-tuning just teaches it the specific format of the task. This is the birth of the foundation model — one model that serves as the foundation for many downstream tasks.
Click a word to mask it. The model shows its top-5 predictions for the masked position. Notice how context from both sides informs the prediction.
Here is the question that keeps AI researchers up at night: when GPT-4 writes a coherent essay about quantum mechanics, does it understand quantum mechanics? Or is it doing something more like a very sophisticated parrot — reproducing patterns from its training data without any genuine comprehension?
Manning addresses this directly in his Daedalus essay, and his answer is more nuanced than either side of the debate usually admits.
Manning argues that meaning is not a binary — either you understand or you don't. Meaning is "the network of connections between a linguistic form and other things." A word's meaning is its connections to other words, to facts about the world, to sensory experiences, to actions you can perform.
Consider the word "shehnai" (a type of oboe played in South Asian music). If you've never heard this word before, you have zero connections. After reading this parenthetical definition, you have a few: it's an instrument, it's like an oboe, it's associated with South Asian music. After hearing one played, you have more: its sound, its appearance, the feeling of the music. After learning to play one, you have even more: the feel of the reed, the fingering patterns, the breath control.
LLMs have an extraordinarily rich linguistic connection network. They've seen billions of sentences using every word in thousands of contexts. They can correctly complete "The shehnai is a type of ___" (musical instrument). They can tell you it's associated with weddings. They can distinguish it from a sitar or a tabla.
What they lack is grounding — connection to the physical world. An LLM has never heard a shehnai. It has never felt wind or tasted food or seen a sunset. Its entire connection network is text-to-text. Manning calls this "partial meaning" — real understanding of linguistic relationships, but no embodied experience.
This distinction matters practically. LLMs excel at tasks where linguistic knowledge is sufficient: translation, summarization, question answering about text. They struggle at tasks requiring physical intuition: "If I put a ball on a slanted table, which direction does it roll?" An LLM might get this right from having read physics texts, but it doesn't have the intuitive understanding that a toddler gets from playing with balls.
John Searle's famous Chinese Room argument (1980) claims that a system following rules to manipulate Chinese characters doesn't "understand" Chinese, even if its outputs are perfect. Manning's counter: understanding IS the network of connections. If the system has rich enough connections between linguistic forms and other knowledge, that constitutes a form of understanding — regardless of the substrate (neurons vs. silicon).
This is a live debate. It's not settled. But Manning's framework gives us a useful way to think about it: don't ask "does it understand?" Ask "how rich is its connection network, and what domains of connection does it have?"
Click a word to see its connection network — how it links to other concepts, facts, and associations. Richer networks = deeper understanding. Compare what an LLM "knows" (text connections) vs. what a human knows (text + sensory + motor).
We've traced four eras of NLP. Now let's see the full picture — every major milestone from 1950 to 2025, color-coded by era, interactive, and detailed. This is the payoff: the entire history of language AI in one scrollable, clickable timeline.
The timeline below includes 25 milestones spanning 75 years. Click any milestone to see details: what it was, why it mattered, and how it connects to the era's themes. Drag or use the controls to scroll through time. Watch how progress accelerates exponentially — more happened between 2017 and 2023 than in the entire preceding half-century.
Click any milestone to see details. Use the era buttons to filter. Drag the timeline or use arrow buttons to scroll.
A model that generates fluent, confident, well-structured text is dangerous precisely because it is fluent. If it's wrong, it's wrong persuasively. Manning identifies several critical limitations and risks that the NLP community is actively grappling with.
LLMs generate text by predicting the most likely next token. They are not retrieving facts from a database. They are not reasoning from first principles. They are completing patterns. This means they will confidently state things that are entirely false — a phenomenon called hallucination.
Ask GPT-3 to write a biography of a minor historical figure, and it may invent publications, misattribute quotes, or fabricate entire events — all in perfectly grammatical, authoritative prose. The fluency makes it harder to detect errors, not easier.
LLMs can do something that looks like logical reasoning: "All men are mortal. Socrates is a man. Therefore, Socrates is mortal." But Manning argues this is pattern completion, not logical inference. When the reasoning chain is unfamiliar or requires novel steps, LLMs falter. Multi-step math, complex planning, and counterfactual reasoning remain weak points.
Models trained on human text learn human biases. In early word embeddings, "doctor" was closer to "man" than to "woman" in vector space, while "nurse" was closer to "woman." These biases weren't inserted deliberately — they reflect statistical patterns in the training data, which reflects historical and societal inequalities.
This matters because NLP models are increasingly used for hiring, content moderation, medical triage, and legal analysis. A biased model applied at scale amplifies the biases in its training data — potentially affecting millions of people.
Training a frontier LLM costs tens of millions of dollars in compute. This means only a handful of organizations — OpenAI, Google, Meta, Anthropic, a few others — can build foundation models. Manning raises the concern that this concentrates enormous power: whoever controls the foundation model controls the applications built on top of it.
Open-source models (LLaMA, Mistral, OLMo) are a partial answer, but the compute needed to train them remains a barrier. The gap between "can fine-tune" and "can pretrain from scratch" is enormous.
Perhaps the deepest risk: we have built systems that sound like they understand, and this creates a mismatch of expectations. Users attribute reasoning, knowledge, and intent to models that are doing pattern completion. This leads to over-reliance on model outputs, under-scrutiny of model errors, and inappropriate trust in model "judgments."
See how word embeddings encode societal biases. The distances shown reflect real patterns found in early embedding models. Use the slider to compare old (biased) vs. modern (debiased) embeddings.
Lecture 1 of CS224N is a bird's-eye view. Every concept introduced here gets its own deep treatment in later lectures. The history lesson isn't just backdrop — it motivates why modern NLP works the way it does. Each era's failure explains the next era's design.
| Lectures | Topic | Connects To |
|---|---|---|
| 2 | Word Vectors (Word2Vec, GloVe) | Ch 4: embedding breakthrough |
| 3 | Backprop & Neural Nets | Ch 4: the neural revolution |
| 4 | Language Models & RNNs | Ch 3: statistical models → Ch 4: neural models |
| 5 | Seq2Seq, Attention, Transformers | Ch 4: the Transformer architecture |
| 7 | Pretraining (BERT, GPT) | Ch 5: the self-supervised breakthrough |
| 8 | Post-training (RLHF, DPO) | Ch 8: aligning models with human values |
| 12–13 | Reasoning & Agents | Ch 6: understanding vs. pattern matching |
| Era | Years | Approach | Automated | Still Manual |
|---|---|---|---|---|
| 1: Translation | 1950–1969 | Dictionary lookup + rules | Lookup | Everything else |
| 2: Hand-Built AI | 1970–1992 | Expert systems, grammars | Parsing | Rules, vocabulary, world model |
| 3: Statistical | 1993–2012 | Count patterns, learn weights | Weight learning | Feature design |
| 4: Neural | 2013–present | Learn representations end-to-end | Features + weights | Architecture choice, data curation |
"We may be getting our first glimpses of a more general form of artificial intelligence — but we are not done yet. The current models have remarkable abilities, but they lack careful logical reasoning, they confidently generate falsehoods, and they are opaque in their functioning."
— Christopher Manning, "Human Language Understanding & Reasoning," Daedalus (2022)