CS224N Lecture 1 — The History of Language AI

Chapter 0: Why Language?

You speak about 16,000 words per day. Each one carries meaning, context, ambiguity. Right now, reading this sentence, your brain is performing dozens of operations simultaneously: parsing syntax, resolving pronoun references, inferring tone, predicting what comes next. You do this effortlessly, without thinking about it, thousands of times before lunch.

Now imagine building a machine that handles all of that.

Language is humanity's superpower. Christopher Manning calls it the technology that lets us "network human brains together." A single sentence like "The bank refused the loan because it was too risky" requires you to know that "it" refers to the loan, not the bank — and that "bank" here means a financial institution, not a riverbank. This kind of reasoning, which children master by age five, has stumped AI researchers for over seventy years.

The fundamental challenge: Language is ambiguous at every level. Words have multiple meanings. Sentences have multiple parses. Paragraphs have multiple interpretations. And meaning depends on context that can stretch across entire documents — or require knowledge about the world that isn't in the text at all.

Consider the word "bank." In isolation, it could mean a financial institution, a river's edge, a pool shot, or the act of tilting an airplane. Humans disambiguate instantly from context. Machines? For decades, this single problem — word sense disambiguation — was an entire subfield of NLP research.

Or take the sentence "I saw her duck." Is she crouching, or do you see her pet waterfowl? The ambiguity isn't just about words. It's structural: the sentence has two completely valid grammatical parses, each producing a different meaning. This is called syntactic ambiguity, and it haunts every NLP system ever built.

This lesson traces the 75-year arc of NLP — from naive word-by-word translation in the 1950s, through hand-built rule systems in the 1970s, statistical methods in the 1990s, and the neural revolution that produced GPT and BERT. Along the way, we'll grapple with a question that Manning poses directly: do these models actually understand language, or are they just very sophisticated pattern matchers?

What this lesson covers: Four eras of NLP: rule-based translation (1950s), hand-built AI (1970s), statistical methods (1990s), and neural models (2013–present). The self-supervised breakthrough. The question of understanding vs. pattern matching. And the risks that come with systems that talk fluently but don't always think carefully.

Word Ambiguity Explorer

Click a sentence to see how context determines word meaning. The highlighted word is the same in every sentence — but context changes everything.

Click a sentence above to explore word ambiguity.

Why is the sentence "I saw her duck" ambiguous?

The word "saw" has multiple meanings The sentence is grammatically incorrect It has two valid grammatical parses with different meanings (crouching vs. waterfowl)

Chapter 1: Era 1 — The Dream of Translation (1950–1969)

January 7, 1954. Georgetown University and IBM publicly demonstrate a machine that translates Russian into English. The system handles 60 sentences using a vocabulary of just 250 words and 6 grammar rules. The press goes wild. The New York Times reports that fully automatic translation is three to five years away.

It wasn't. It was decades away. But let's understand why that demo was so electrifying.

The Cold War is at its peak. The Soviet Union is producing scientific literature at a staggering rate — papers on physics, chemistry, rocketry — and almost none of it is being read in the West because there aren't enough Russian-English translators. Warren Weaver, a Rockefeller Foundation director, writes a famous memo in 1949 proposing that translation is fundamentally a problem of cryptography: Russian text is just English "written in a strange cipher." If we broke Enigma, surely we can break Russian.

Weaver's analogy: "When I look at a Russian article, I say: this is really written in English, but it has been coded in some strange symbols. I will now proceed to decode." This framing — language as code — shaped the first two decades of NLP. It was a beautiful idea. It was also fundamentally wrong.

How Word-Level Translation Works (and Fails)

The Georgetown-IBM system was essentially a dictionary lookup with morphological rules. For each Russian word, find the English equivalent. Apply a few reordering rules because Russian and English put words in different orders. Output the result.

This approach is called direct translation or word-by-word translation. For some sentences, it works tolerably well. "The student reads the book" translates word-for-word between many language pairs. But language is not that cooperative.

Consider translating the German sentence "Er hat den Mann, der den Hund geschlagen hat, gesehen" word-by-word into English. You get something like: "He has the man who the dog beaten has seen." The actual meaning is: "He saw the man who beat the dog." The words are all there, but the structure is mangled. German puts verbs at the end of subordinate clauses. English doesn't. No amount of dictionary lookup can fix this — you need to understand sentence structure.

Or consider idioms. Translating the French "il pleut des cordes" word-by-word gives "it rains ropes." The meaning: "it's raining cats and dogs." Languages encode the same ideas using completely different metaphors. A dictionary can't bridge that gap.

The ALPAC Report: Funding Dies

By the early 1960s, machine translation research had consumed millions of dollars with underwhelming results. In 1966, the National Academy of Sciences commissioned a review. The ALPAC report (Automatic Language Processing Advisory Committee) concluded that machine translation was "not practical" and that funding should be redirected to basic linguistic research. Translation funding was slashed almost overnight. The field entered a decade-long winter.

The lesson of ALPAC: Ambitious AI promises, inadequate methods, and a funding crash. This pattern — hype, disappointment, winter — recurs throughout AI history. The 1966 crash was the first. It wouldn't be the last.

Naive Translator

See how word-by-word translation fails. Click an example to compare literal (word-level) translation with the actual meaning. Notice how word order, idioms, and structure destroy the naive approach.

What was the main conclusion of the 1966 ALPAC report?

Machine translation needed bigger dictionaries Machine translation was not yet practical; funding should shift to basic research Word-by-word translation worked well for European languages

Chapter 2: Era 2 — Hand-Built Intelligence (1970–1992)

After the ALPAC crash, researchers pivoted. Instead of trying to translate entire languages, they asked a narrower question: can a computer understand language within a limited domain? Can it follow instructions, answer questions, carry out tasks — as long as we carefully define the world it operates in?

The answer was yes. And the most famous demonstration was Terry Winograd's SHRDLU (1972).

SHRDLU: A Robot That Understands Blocks

SHRDLU lived in a simulated world of colored blocks on a table. You could type natural English commands: "Pick up a big red block." "Put the blue pyramid on the block in the box." "Does the box contain anything?" And SHRDLU would do it, responding in English, tracking the state of its world, even answering questions about why it did something.

The system was breathtaking. It parsed complex, nested sentences. It resolved pronoun references ("put it on..." — which object is "it"?). It reasoned about spatial relationships. For a brief moment, it looked like the language understanding problem was nearly solved.

It wasn't. SHRDLU worked because its world had exactly one table, a handful of blocks, and about 50 vocabulary words. The linguistic knowledge was painstakingly hand-coded as rules: "If the user says 'pick up X,' find the object matching X, check if the gripper is free, move the gripper to X, close the gripper." Every possible sentence pattern required a rule. Every new object required new rules. Every new domain required starting from scratch.

The SHRDLU paradox: It was simultaneously the most impressive and the most limited NLP system of its era. Within its 50-word micro-world, it was near-perfect. Outside that world, it was helpless. This is the hallmark of hand-built systems: deep competence in narrow domains, zero transfer to anything else.

LUNAR: Moon Rocks and Databases

Around the same time, William Woods built LUNAR (1978), a system that answered natural language questions about the chemical composition of lunar rock samples brought back by Apollo missions. "What is the average concentration of aluminum in high-silica rocks?" LUNAR would parse the question, convert it to a database query, execute it, and return the answer in English.

LUNAR worked well within its domain — about 90% accuracy on the questions scientists actually asked about moon rocks. But like SHRDLU, it was hand-built for exactly one dataset. Ask it about Mars rocks and it would crash, not because the chemistry was different, but because nobody had written rules for Mars-related vocabulary.

The Architecture: Declarative Knowledge + Procedural Processing

These systems shared a design philosophy that Manning emphasizes: separate declarative linguistic knowledge (grammar rules, word meanings, world facts) from procedural processing (parsing algorithms, inference engines). The idea was that if you got the grammar right, you could swap in different grammars for different languages or domains.

Input Sentence

"Put the red block on the blue block"

↓

Syntactic Parser

Decompose into parts: verb(put), object(red block), destination(blue block)

↓

Semantic Interpreter

Map parse to meaning: MOVE(obj=red_block, dest=ON(blue_block))

↓

World Model

Check feasibility, update state, execute action

This pipeline — parse, interpret, act — was elegant. But it required human experts to write every rule, every word meaning, every grammar pattern. Manning notes that progress was real but "agonizingly slow." By the early 1990s, the biggest hand-built systems had thousands of rules and still couldn't handle open-domain text.

Why hand-built systems failed to scale: English has an estimated 170,000+ words in current use. Each word can participate in dozens of syntactic constructions. Each construction can have subtle meaning shifts depending on context. Writing rules for all of this by hand is like trying to build a road to every house in the world — individually, by hand, one brick at a time.

Mini SHRDLU

A tiny blocks world. Click a command to watch the system parse and execute it. Notice how even simple commands require parsing, reference resolution, and world-state tracking.

Click a command to interact with the blocks world.

SHRDLU could understand complex English sentences within its domain. Why couldn't this approach scale to general language understanding?

Every new word, construction, and domain required hand-written rules — an impossibly large effort Computers in the 1970s were too slow to run the algorithms English grammar is too irregular for any rule-based system

Chapter 3: Era 3 — Let the Data Speak (1993–2012)

In the early 1990s, something shifted. Linguists had spent decades hand-crafting rules. Computer scientists had spent decades building narrow expert systems. And then a generation of researchers said: what if we stop trying to encode human knowledge, and instead let the data tell us the patterns?

The raw material was suddenly available. The internet was exploding. The Penn Treebank (1993) — a corpus of Wall Street Journal articles where every sentence had been manually annotated with its grammatical structure — gave researchers a shared benchmark. Digitized books, news archives, and web crawls provided billions of words of raw text. For the first time, NLP had data at scale.

The Statistical Revolution

The core idea of statistical NLP: instead of writing rules, count patterns. How often does the word "the" appear before "cat"? How often does "cat" appear as a noun vs. a verb? If you've seen "the cat sat on the ___," what word is most likely to fill the blank?

These questions can be answered by counting. Given a large enough corpus, you can estimate the probability of any word following any other word. This is called a language model — specifically, an n-gram language model, where n is the number of words you look at for context.

P(w_n | w₁, ..., w_n-1) ≈ P(w_n | w_n-1) (bigram approximation)

A bigram model predicts the next word using only the previous word. Simple, but surprisingly powerful. "I ate ___" — a bigram model trained on English text will assign high probability to "lunch," "dinner," "breakfast," and low probability to "elephant" or "theorem." It has learned something about how English works, without a single hand-written rule.

The Tools: HMMs, CRFs, MaxEnt

Statistical NLP produced a toolkit of models that dominated the field for 20 years:

Model	Task	Key Idea
HMM (Hidden Markov Model)	Part-of-speech tagging	Words are observations; POS tags are hidden states
CRF (Conditional Random Field)	Named entity recognition	Globally optimal tag sequence, not greedy
MaxEnt (Maximum Entropy)	Text classification	Make no assumptions beyond observed features
PCFG (Probabilistic Context-Free Grammar)	Syntactic parsing	Grammar rules with probabilities

All of these models shared a paradigm: a human expert designs features (is this word capitalized? does it end in "-tion"? is the previous word a verb?), and the algorithm learns the weights of those features from data. The features are hand-crafted; the weights are learned. Manning calls this the shift from "hand-craft rules" to "hand-craft features."

The key shift: In Era 2, humans wrote rules: "If the word ends in -ed, it's past tense." In Era 3, humans wrote features: "Here's a feature: does the word end in -ed?" Then the algorithm learned from data: "This feature is strongly predictive of the VERB-PAST tag." The human still did creative work, but the learning was automated.

What Statistical Models Could and Couldn't Do

Statistical models dominated NLP benchmarks throughout the 2000s. Part-of-speech tagging accuracy reached 97%. Named entity recognition was commercially deployed (finding person names, company names, locations in news text). Machine translation improved steadily through phrase-based statistical MT, where the system learned to translate short phrases rather than individual words.

But statistical models had a fundamental limitation: they couldn't capture long-range dependencies. A bigram model doesn't know that "the cat that the dog that the rat bit chased ran away" is about a cat running. It can only see one word back. Even trigrams and 5-grams couldn't bridge the gap between "the cat" and "ran" when many words intervened.

The feature engineering bottleneck: For every new language, every new domain, every new task, somebody had to sit down and design features. Features for medical text were different from features for legal text. Features for Chinese were different from features for Arabic. This manual feature engineering was the new bottleneck — faster than hand-writing rules, but still fundamentally human-labor-bound.

Bigram Language Model

Watch a bigram model generate text word by word. Each word is chosen based on the probability P(next | current). Click "Generate" to produce text, and notice how it's locally coherent but globally nonsensical.

Click Generate to see bigram probabilities in action.

What did statistical NLP automate compared to Era 2, and what did it still require humans to do?

It automated everything — no human input needed It automated learning feature weights from data, but humans still had to design the features It automated grammar rules but still needed humans to label every sentence

Chapter 4: Era 4 — The Neural Revolution (2013–Present)

In 2013, a paper from Google changed NLP forever. Tomas Mikolov and colleagues published Word2Vec, a method that represented every word as a dense vector in a continuous space — and these vectors captured meaning in ways nobody expected.

The idea itself wasn't new. Yoshua Bengio had proposed neural language models in 2003. Collobert and Weston had shown that neural networks could learn useful word representations in 2008. But Word2Vec was fast enough to train on billions of words, and the resulting vectors had a remarkable property.

The Embedding Breakthrough

Word2Vec learned a 300-dimensional vector for each word by training on a simple task: predict a word from its context (or predict context from a word). The resulting vectors captured semantic relationships as geometric relationships. Words with similar meanings clustered together. And relationships between words could be expressed as vector arithmetic:

vec("king") − vec("man") + vec("woman") ≈ vec("queen")

This was astonishing. The model had never been told that kings and queens are related, or that gender is a dimension of meaning. It learned these relationships purely from patterns in text — from the fact that "king" and "queen" appear in similar contexts, modified by the same gender-related context shifts as "man" and "woman."

Why this matters: For the first time, a model learned its own representations — not features hand-designed by a human, not rules hand-written by a linguist, but dense vectors that captured meaning directly from data. This eliminated the feature engineering bottleneck of Era 3 entirely. The shift from Era 3 to Era 4 is: from "hand-craft features" to "learn representations."

From Features to End-to-End Learning

Once you have word vectors, you can feed them into neural networks. The field moved rapidly through several architectures:

Year	Architecture	Key Innovation
2013	Word2Vec	Dense word vectors from context prediction
2014	GloVe	Global co-occurrence statistics + embedding
2014	Seq2Seq	Encoder-decoder for variable-length sequences
2015	Attention	Focus on relevant parts of input dynamically
2017	Transformer	Self-attention replaces recurrence entirely

Recurrent Neural Networks (RNNs) could process sequences of any length by maintaining a hidden state that updated at each time step. But they struggled with long sequences — the signal from early words faded as it passed through many time steps (the vanishing gradient problem). LSTMs (Long Short-Term Memory networks) partially solved this with gating mechanisms, but were slow to train because each step depended on the previous one.

The Transformer (Vaswani et al., 2017) eliminated recurrence entirely. Instead of processing words one by one, it processed all words in parallel, using self-attention to let every word look at every other word directly. This was faster to train (parallelizable on GPUs) and captured long-range dependencies naturally. The Transformer became the foundation for everything that followed.

The progression of automation: Era 1: hand-write dictionaries. Era 2: hand-write rules. Era 3: hand-craft features, learn weights. Era 4: learn features AND weights from raw data. Each era automated one more step of the pipeline. The Transformer automated the last piece: the architecture itself was general enough to handle any sequence task.

Word Embedding Space

A simplified 2D view of word embeddings. Words with similar meanings cluster together. Click an analogy to see how vector arithmetic captures relationships between words.

What did Word2Vec demonstrate that surprised researchers?

Semantic relationships like gender and royalty emerged as geometric relationships in vector space, without explicit supervision Neural networks could translate between languages perfectly Word vectors made statistical models like HMMs faster

Chapter 5: The Self-Supervised Breakthrough

2018 was a watershed year for NLP. Two models arrived that changed the field's trajectory permanently: BERT (from Google) and GPT (from OpenAI). Both were built on Transformers. Both were trained on massive amounts of text. But their key innovation wasn't architectural — it was the training paradigm.

The Old Problem: Labeled Data

Statistical and early neural NLP systems needed labeled data — sentences manually annotated with parts of speech, named entities, sentiment, or translations. Creating this data was expensive. The Penn Treebank, which took years to build, contained about 1 million words. Medical NER datasets might have only 10,000 labeled sentences. Each new task required a new labeled dataset.

The breakthrough insight: the text itself is the supervision.

BERT: Mask and Predict

BERT's training objective is beautifully simple. Take a sentence. Randomly replace 15% of the words with a special [MASK] token. Train the model to predict the original words. No human annotation needed — the original text provides the correct answers.

Original

"The cat sat on the mat"

↓ mask 15%

Input

"The [MASK] sat on the mat"

↓ predict

Output

P(cat) = 0.72, P(dog) = 0.11, P(bird) = 0.04, ...

To predict the masked word correctly, BERT must learn grammar ("sat" requires a noun before it), semantics ("the ___ sat on the mat" suggests an animal), and world knowledge ("cats sit on mats more often than elephants do"). All from predicting masked words.

GPT: Predict the Next Word

GPT takes an even simpler approach: given all previous words, predict the next one. This is exactly the language modeling objective from Era 3 — but with a Transformer instead of an n-gram model, and trained on orders of magnitude more data.

L = − ∑_i log P(w_i | w₁, ..., w_i-1; θ)

The key difference from bigram models: the Transformer can attend to the entire preceding context, not just the last one or two words. It can learn that "The president of France, who recently visited Germany and spoke with the chancellor about trade policy, ___" should be continued with a verb whose subject is "president." That's a dependency spanning 20 words — impossible for n-gram models, natural for Transformers.

Self-supervised = unlimited data: Supervised learning needs humans to label data. Self-supervised learning creates its own labels from raw text. The internet has trillions of words. This means self-supervised models can train on essentially unlimited data. BERT was trained on 3.3 billion words (Wikipedia + BooksCorpus). GPT-3 was trained on 300 billion tokens. The scale is the magic.

Foundation Models: One Model, Many Tasks

Before BERT, each NLP task had its own model. Sentiment analysis? Train a classifier. Named entity recognition? Train a sequence labeler. Question answering? Train an entirely different model. Each from scratch.

BERT introduced the pretrain-finetune paradigm: pretrain one large model on self-supervised objectives, then fine-tune it on each specific task with a small amount of labeled data. The pretrained model already "knows" English; fine-tuning just teaches it the specific format of the task. This is the birth of the foundation model — one model that serves as the foundation for many downstream tasks.

The impact: BERT immediately broke records on 11 NLP benchmarks when it was released in October 2018. Sentiment analysis, question answering, textual entailment — across the board. A single architecture, pretrained once, then fine-tuned. The era of task-specific models was over.

Masked Language Model

Click a word to mask it. The model shows its top-5 predictions for the masked position. Notice how context from both sides informs the prediction.

Click a word in the sentence to mask it and see predictions.

Why is self-supervised learning a breakthrough for NLP?

It makes models smaller and faster to train It replaces Transformers with more efficient architectures It eliminates the need for labeled data — the text itself provides supervision, enabling training on unlimited data

Chapter 6: Understanding vs. Pattern Matching

Here is the question that keeps AI researchers up at night: when GPT-4 writes a coherent essay about quantum mechanics, does it understand quantum mechanics? Or is it doing something more like a very sophisticated parrot — reproducing patterns from its training data without any genuine comprehension?

Manning addresses this directly in his Daedalus essay, and his answer is more nuanced than either side of the debate usually admits.

What "Meaning" Means

Manning argues that meaning is not a binary — either you understand or you don't. Meaning is "the network of connections between a linguistic form and other things." A word's meaning is its connections to other words, to facts about the world, to sensory experiences, to actions you can perform.

Consider the word "shehnai" (a type of oboe played in South Asian music). If you've never heard this word before, you have zero connections. After reading this parenthetical definition, you have a few: it's an instrument, it's like an oboe, it's associated with South Asian music. After hearing one played, you have more: its sound, its appearance, the feeling of the music. After learning to play one, you have even more: the feel of the reed, the fingering patterns, the breath control.

Manning's meaning spectrum: Understanding isn't binary. It's a spectrum defined by the density of your connection network. A toddler "understands" the word "dog" — they can point at one. A veterinarian understands it differently — they know breeds, diseases, anatomy. Both have understanding; the vet has a richer connection network. The question for LLMs is: how rich is their connection network?

What LLMs Have and What They Lack

LLMs have an extraordinarily rich linguistic connection network. They've seen billions of sentences using every word in thousands of contexts. They can correctly complete "The shehnai is a type of ___" (musical instrument). They can tell you it's associated with weddings. They can distinguish it from a sitar or a tabla.

What they lack is grounding — connection to the physical world. An LLM has never heard a shehnai. It has never felt wind or tasted food or seen a sunset. Its entire connection network is text-to-text. Manning calls this "partial meaning" — real understanding of linguistic relationships, but no embodied experience.

This distinction matters practically. LLMs excel at tasks where linguistic knowledge is sufficient: translation, summarization, question answering about text. They struggle at tasks requiring physical intuition: "If I put a ball on a slanted table, which direction does it roll?" An LLM might get this right from having read physics texts, but it doesn't have the intuitive understanding that a toddler gets from playing with balls.

The Winograd Schema Challenge: "The trophy doesn't fit in the brown suitcase because it is too big." Does "it" refer to the trophy or the suitcase? Humans instantly know it's the trophy. This requires understanding physical size relationships — exactly the kind of grounded reasoning that tests whether models truly understand. Modern LLMs get most Winograd schemas right, but through text patterns, not physical intuition.

The Chinese Room vs. The Connection Network

John Searle's famous Chinese Room argument (1980) claims that a system following rules to manipulate Chinese characters doesn't "understand" Chinese, even if its outputs are perfect. Manning's counter: understanding IS the network of connections. If the system has rich enough connections between linguistic forms and other knowledge, that constitutes a form of understanding — regardless of the substrate (neurons vs. silicon).

This is a live debate. It's not settled. But Manning's framework gives us a useful way to think about it: don't ask "does it understand?" Ask "how rich is its connection network, and what domains of connection does it have?"

Meaning Network

Click a word to see its connection network — how it links to other concepts, facts, and associations. Richer networks = deeper understanding. Compare what an LLM "knows" (text connections) vs. what a human knows (text + sensory + motor).

Click a word to explore its meaning network.

According to Manning, what is "meaning"?

A fixed definition stored in a dictionary The network of connections between a linguistic form and other things (words, facts, experiences) The ability to pass the Turing test

Chapter 7: The Capability Explosion

We've traced four eras of NLP. Now let's see the full picture — every major milestone from 1950 to 2025, color-coded by era, interactive, and detailed. This is the payoff: the entire history of language AI in one scrollable, clickable timeline.

The timeline below includes 25 milestones spanning 75 years. Click any milestone to see details: what it was, why it mattered, and how it connects to the era's themes. Drag or use the controls to scroll through time. Watch how progress accelerates exponentially — more happened between 2017 and 2023 than in the entire preceding half-century.

What modern NLP can do: Machine translation that rivals human translators (for common language pairs). Question answering across domains. Code generation. Summarization. Few-shot learning: give GPT-3 a few examples of a task it has never seen, and it performs it. Zero-shot: describe a task in words, and it just does it. The gap between "pattern matching" and "understanding" gets blurrier every year.

NLP Milestones: 1950–2025

Click any milestone to see details. Use the era buttons to filter. Drag the timeline or use arrow buttons to scroll.

Click a milestone to see its story.

Between the Transformer paper (2017) and ChatGPT (2022), five years passed. Between the Georgetown-IBM experiment (1954) and SHRDLU (1972), eighteen years passed. What does this acceleration tell us?

Earlier researchers were less intelligent Once the right paradigm (neural, self-supervised) was found, progress compounds — each advance enables the next more quickly The later breakthroughs were easier problems

Chapter 8: Risks & Limitations

A model that generates fluent, confident, well-structured text is dangerous precisely because it is fluent. If it's wrong, it's wrong persuasively. Manning identifies several critical limitations and risks that the NLP community is actively grappling with.

1. Hallucination

LLMs generate text by predicting the most likely next token. They are not retrieving facts from a database. They are not reasoning from first principles. They are completing patterns. This means they will confidently state things that are entirely false — a phenomenon called hallucination.

Ask GPT-3 to write a biography of a minor historical figure, and it may invent publications, misattribute quotes, or fabricate entire events — all in perfectly grammatical, authoritative prose. The fluency makes it harder to detect errors, not easier.

2. Lack of Careful Reasoning

LLMs can do something that looks like logical reasoning: "All men are mortal. Socrates is a man. Therefore, Socrates is mortal." But Manning argues this is pattern completion, not logical inference. When the reasoning chain is unfamiliar or requires novel steps, LLMs falter. Multi-step math, complex planning, and counterfactual reasoning remain weak points.

The reasoning gap: If you ask an LLM "If I have 3 boxes, and each box contains 2 red balls and 3 blue balls, how many balls do I have total?" it will usually get this right — it's a common pattern. But if you add constraints, exceptions, and multi-step dependencies, performance degrades sharply. Fluency creates an illusion of competence.

3. Bias

Models trained on human text learn human biases. In early word embeddings, "doctor" was closer to "man" than to "woman" in vector space, while "nurse" was closer to "woman." These biases weren't inserted deliberately — they reflect statistical patterns in the training data, which reflects historical and societal inequalities.

This matters because NLP models are increasingly used for hiring, content moderation, medical triage, and legal analysis. A biased model applied at scale amplifies the biases in its training data — potentially affecting millions of people.

4. Concentrated Power

Training a frontier LLM costs tens of millions of dollars in compute. This means only a handful of organizations — OpenAI, Google, Meta, Anthropic, a few others — can build foundation models. Manning raises the concern that this concentrates enormous power: whoever controls the foundation model controls the applications built on top of it.

Open-source models (LLaMA, Mistral, OLMo) are a partial answer, but the compute needed to train them remains a barrier. The gap between "can fine-tune" and "can pretrain from scratch" is enormous.

The fundamental tension: The same property that makes foundation models powerful — learning from massive, diverse data — is what makes them risky. The data contains biases, falsehoods, and toxic content. The scale makes these problems hard to audit. The fluency makes errors hard to detect. And the concentration of training capability means fewer eyes on the problem.

5. The Gap Between Fluency and Understanding

Perhaps the deepest risk: we have built systems that sound like they understand, and this creates a mismatch of expectations. Users attribute reasoning, knowledge, and intent to models that are doing pattern completion. This leads to over-reliance on model outputs, under-scrutiny of model errors, and inappropriate trust in model "judgments."

Bias in Word Embeddings

See how word embeddings encode societal biases. The distances shown reflect real patterns found in early embedding models. Use the slider to compare old (biased) vs. modern (debiased) embeddings.

Era 2013 embeddings

Why is hallucination particularly dangerous in LLMs?

It makes the model run slower It only happens with small models The model states falsehoods in fluent, authoritative prose, making errors hard to detect

Chapter 9: Connections — Where to Go From Here

Lecture 1 of CS224N is a bird's-eye view. Every concept introduced here gets its own deep treatment in later lectures. The history lesson isn't just backdrop — it motivates why modern NLP works the way it does. Each era's failure explains the next era's design.

CS224N Roadmap

Lectures	Topic	Connects To
2	Word Vectors (Word2Vec, GloVe)	Ch 4: embedding breakthrough
3	Backprop & Neural Nets	Ch 4: the neural revolution
4	Language Models & RNNs	Ch 3: statistical models → Ch 4: neural models
5	Seq2Seq, Attention, Transformers	Ch 4: the Transformer architecture
7	Pretraining (BERT, GPT)	Ch 5: the self-supervised breakthrough
8	Post-training (RLHF, DPO)	Ch 8: aligning models with human values
12–13	Reasoning & Agents	Ch 6: understanding vs. pattern matching

The Four Eras at a Glance

Era	Years	Approach	Automated	Still Manual
1: Translation	1950–1969	Dictionary lookup + rules	Lookup	Everything else
2: Hand-Built AI	1970–1992	Expert systems, grammars	Parsing	Rules, vocabulary, world model
3: Statistical	1993–2012	Count patterns, learn weights	Weight learning	Feature design
4: Neural	2013–present	Learn representations end-to-end	Features + weights	Architecture choice, data curation

Key Papers from This Lecture

Manning, C. "Human Language Understanding & Reasoning." Daedalus, 2022.
Mikolov et al. "Efficient Estimation of Word Representations in Vector Space." 2013. (Word2Vec)
Vaswani et al. "Attention Is All You Need." NeurIPS 2017. (Transformer)
Devlin et al. "BERT: Pre-training of Deep Bidirectional Transformers." NAACL 2019.
Brown et al. "Language Models are Few-Shot Learners." NeurIPS 2020. (GPT-3)
Marcus & Mitchell. "GPT and the Art of BS." 2023. (The understanding debate)

Related Lessons

Transformer — From Absolute Zero — deep dive into self-attention, positional encoding, the full architecture
GPT — From Absolute Zero — autoregressive language modeling, scaling laws, emergent abilities
Reward & Alignment — RLHF, DPO, and the post-training pipeline

What You Can Now Do

Explain why word-by-word translation fails and what ALPAC concluded
Describe the SHRDLU system and why hand-built NLP couldn't scale
Distinguish n-gram models from neural language models
Explain what word embeddings are and why king − man + woman ≈ queen works
Describe BERT's masked language model and GPT's autoregressive objective
Articulate Manning's view that meaning is a network of connections
Identify key risks: hallucination, bias, concentrated power, reasoning gaps

"We may be getting our first glimpses of a more general form of artificial intelligence — but we are not done yet. The current models have remarkable abilities, but they lack careful logical reasoning, they confidently generate falsehoods, and they are opaque in their functioning."
— Christopher Manning, "Human Language Understanding & Reasoning," Daedalus (2022)