CS224N Lecture 15 — Interpreting Neural Networks

Chapter 0: Why Interpretability?

A language model generates a toxic response. Why? A model refuses a harmless request. Why? A model aces the bar exam but can't count to ten reliably. Why? These aren't rhetorical questions — they're the central challenge of interpretability: understanding what's actually happening inside neural networks.

Consider GPT-4. It has roughly 1.8 trillion parameters organized into hundreds of layers. When it writes a poem, which parameters are doing the "creativity"? When it solves a math problem, which circuits implement the arithmetic? When it refuses a prompt, is that safety training or a bug? We genuinely don't know.

This is unlike any technology humanity has built before. We can open a car engine and trace every part. We can read the source code of a program. But a neural network? It's a massive matrix of floating-point numbers. The "source code" is the weights, and they're not human-readable.

The Black Box Problem

A prompt goes in, a response comes out. What happens in between? Click to "open" the black box and see the layers — but even looking at the numbers tells you almost nothing.

The simulation above captures the frustration of interpretability research: even when you look inside the model, you see millions of numbers that don't obviously correspond to concepts. The goal of interpretability is to build tools that translate those numbers into human-understandable explanations.

Why It Matters Now

Interpretability isn't just academic curiosity. It has urgent practical implications:

Safety. If we can't understand why a model produces harmful outputs, we can't reliably prevent them. Current safety techniques (RLHF, red-teaming) are behavioral — they shape outputs without understanding internals. Interpretability could enable mechanistic safety: understanding and modifying the actual circuits that produce harmful behavior.

Trust. Medicine, law, and finance need to know why a model made a decision, not just what the decision was. A doctor won't trust a diagnosis model that says "cancer" without explaining its reasoning in terms that map to medical knowledge.

Debugging. When a model fails, interpretability helps find the cause. Is it a training data problem? A representation problem? A specific circuit that computes the wrong thing? Without interpretability, debugging is trial and error.

We are deploying systems we don't understand at unprecedented scale. Billions of people interact with language models daily. Interpretability research aims to close the gap between the power of these systems and our understanding of how they work. This lecture surveys the key tools: probing, attention visualization, sparse autoencoders, and the emerging use of AI agents to interpret AI systems.

A language model consistently refuses to discuss a particular historical event, even when the request is purely educational. Without interpretability tools, what can researchers do to diagnose the cause?

Read the model's source code to find the refusal logic Inspect the model's weights directly to find the responsible neurons Very little beyond behavioral testing (prompting variations) — the model's internal decision process is opaque, which is exactly why interpretability research exists

Chapter 1: Probing & Linear Probes

The simplest interpretability technique: take a model's internal representations and ask "what information is encoded here?" This is probing, and it works by training a small classifier (the probe) on top of frozen model representations.

Here's the setup. You have a pretrained language model. You feed in a sentence like "The cat sat on the mat." At each layer, the model produces a hidden representation for each token — a vector of dimension d (e.g., 768 for BERT-base). The question is: what does that vector encode?

To find out, you train a linear probe: a simple linear classifier (one weight matrix + bias) that takes the hidden representation as input and predicts some linguistic property. For example:

Property	Task	Labels
Part of speech	Classify each token's POS tag	noun, verb, adjective, ...
Dependency relation	Predict syntactic role	subject, object, modifier, ...
Named entity	Classify entity type	person, location, org, ...
Semantic role	Who did what to whom	agent, patient, instrument, ...
Coreference	Do two tokens refer to the same entity?	yes / no

If the linear probe achieves high accuracy, the information must be linearly accessible in the representation — meaning the model has organized its internal space so that this property is easy to read out with a simple linear transformation.

probe(h) = softmax(W · h + b)

Where h is the hidden representation (shape [d_model]), W is the probe's weight matrix (shape [num_classes, d_model]), and b is the bias (shape [num_classes]). The probe is trained with cross-entropy loss, with model weights frozen.

Why Linear?

Why restrict the probe to a linear classifier? Because a sufficiently powerful probe (e.g., a deep neural network) could compute the property from raw features, rather than simply reading it from the representation. A 5-layer MLP probe with 95% accuracy doesn't tell you the representation encodes POS tags — it might tell you the MLP learned to parse syntax from character-level features.

A linear probe is deliberately weak. If it achieves high accuracy, the information must already be organized linearly in the representation. The probe isn't computing anything complex — it's drawing a hyperplane through the representation space that separates the classes. The model did the hard work of arranging representations so that this hyperplane exists.

A probe tells you what information is accessible, not what information is used. Even if a linear probe finds that POS tags are encoded in layer 6, that doesn't mean the model actually uses POS information for its predictions. The information could be a byproduct of learning other things. This is the fundamental limitation of probing — correlation between representation structure and linguistic properties doesn't prove the model "understands" syntax.

Probing Across Layers

The most revealing experiment: train separate probes at every layer of the model. This reveals where different types of information emerge.

Probe Accuracy Across Layers

Accuracy of linear probes trained at each layer of a 12-layer transformer. Surface features (POS) peak early, syntactic features peak in middle layers, semantic features peak in later layers. Click different properties to see their profiles.

A consistent pattern emerges across many studies (Tenney et al. 2019, Hewitt & Manning 2019):

Layers 1-3 (early): Surface-level features peak. Part-of-speech tags, word identity, and morphological features are most accessible here. The model first represents what each token is.

Layers 4-8 (middle): Syntactic features peak. Dependency relations, constituent structure, and agreement patterns become most accessible. The model has figured out how tokens relate to each other structurally.

Layers 9-12 (late): Semantic features peak. Sentiment, coreference, semantic roles, and world knowledge are most accessible. The model has built an understanding of what the text means.

This creates a picture of transformers as information processing pipelines: raw text → morphology → syntax → semantics. Each layer refines the representation from surface-level toward meaning-level, much like the classical NLP pipeline but learned end-to-end.

python
import torch
from transformers import BertModel, BertTokenizer

model = BertModel.from_pretrained('bert-base-uncased')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Get hidden states at every layer
inputs = tokenizer("The cat sat on the mat", return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs, output_hidden_states=True)

# outputs.hidden_states is a tuple of 13 tensors
# Shape: [1, seq_len, 768] for each layer (0 = embedding, 1-12 = layers)
layer_6 = outputs.hidden_states[6]  # [1, 8, 768]

# Train a linear probe on layer 6 for POS tagging:
probe = torch.nn.Linear(768, num_pos_tags)  # simple linear layer
# Freeze BERT, only train probe weights
# → 93% accuracy = POS info is linearly encoded in layer 6

Structural Probing: Beyond Classification

Hewitt & Manning (2019) went beyond classifying individual tokens to probing for tree structure. They trained a probe to predict the distance between any two words in the dependency parse tree, directly from the model's representations.

The key insight: if there exists a linear transformation B such that the squared distance between transformed representations approximates the tree distance, then the parse tree is linearly encoded in the representation space.

d_tree(w_i, w_j) ≈ || B · h_i - B · h_j ||²

Where h_i and h_j are the hidden representations of words i and j, and B is a learned matrix (the structural probe). If this approximation is accurate, it means the model has organized its representation space so that syntactically related words are close together and syntactically distant words are far apart — the tree is literally embedded as a geometric structure in the representation.

The result was striking: BERT's representations encode parse trees with remarkable fidelity. The structural probe could reconstruct dependency trees with ~80% accuracy from just the hidden representations, without any explicit parsing. The model has learned a geometric encoding of syntax as a byproduct of masked language modeling.

python
class StructuralProbe(nn.Module):
    """Learns a linear transform B such that
       ||B*h_i - B*h_j||^2 ≈ tree_distance(i,j)"""
    def __init__(self, d_model=768, rank=64):
        super().__init__()
        self.B = nn.Parameter(torch.randn(rank, d_model))

    def forward(self, h_i, h_j):
        # h_i, h_j: [batch, d_model]
        diff = self.B @ (h_i - h_j).T  # [rank, batch]
        return (diff ** 2).sum(dim=0)   # [batch] = predicted tree distance

BERT encodes syntactic parse trees as geometric distances in representation space. The structural probe shows that syntactically close words (subject-verb) have representations that are close in a linearly-transformed space, while syntactically distant words are far apart. This is not just "information is present" — it's "information is organized geometrically," which is a much stronger finding.

A linear probe trained on layer 3 of BERT achieves 95% accuracy on POS tagging, but the same probe on layer 11 achieves only 88%. What does this suggest?

Layer 11 is broken POS information is most linearly accessible in early layers, and later layers have reorganized the representation to prioritize semantic information, partially overwriting the linear POS structure The probe needs more training data for layer 11

Chapter 2: Attention Visualization

Attention weights are the most intuitive window into a transformer. Each attention head produces a matrix of weights: how much each token "attends to" every other token. Visualizing these weights creates attention heatmaps that sometimes reveal interpretable patterns.

In a sentence like "The cat sat on the mat because it was tired", which tokens does "it" attend to? If the model has learned coreference, "it" should attend strongly to "cat" (its referent). If we see this in the attention weights, it suggests the model is performing something like coreference resolution internally.

Interactive Attention Heatmap

Click on a token to see what it attends to. Brighter cells = stronger attention. Different heads learn different patterns. Try switching heads to see positional, syntactic, and semantic attention patterns.

Attention Head Head 0

Common Attention Patterns

Clark et al. (2019) analyzed BERT's 144 attention heads (12 layers × 12 heads) and found recurring patterns:

Positional heads: Some heads attend primarily to the previous token, the next token, or the first token in the sequence. These implement a kind of local context window, similar to a convolution.

Syntactic heads: Certain heads align with dependency relations. In "The cat sat", a syntactic head on "sat" attends to "cat" (its subject). These heads appear to implement aspects of syntactic parsing.

Separator heads: Some heads attend primarily to [SEP] or [CLS] tokens. Clark et al. argued these act as "no-op" heads: when a token doesn't need to attend to anything meaningful, it dumps attention on a special token.

Coreference heads: Rare but fascinating: heads where pronouns attend to their antecedents. "It" attends to "cat" when "it" refers to the cat. These implement a form of coreference resolution.

The Attention-as-Explanation Problem

Attention visualization is seductive but dangerous. The temptation is to interpret attention weights as "the model is looking at X because of Y." But this reasoning has serious flaws:

1. Attention is not explanation. Jain & Wallace (2019) showed that attention weights can be randomly permuted with minimal impact on model output. If the model produces the same answer with shuffled attention, the original attention pattern can't be "the reason" for the output.

2. Multiple heads. A transformer has many heads per layer and many layers. Visualizing one head shows one tiny slice of the model's computation. The actual computation combines all heads through residual connections. Cherry-picking one head that shows an interpretable pattern ignores the 143 other heads.

3. Residual stream. In modern interpretability frameworks, the important object is not individual attention weights but the residual stream: the cumulative sum of all layer outputs. Each layer reads from and writes to this shared stream. Attention weights are just part of how one layer contributes.

Attention weights are correlational, not causal. Seeing that "it" attends to "cat" doesn't prove the model uses this attention to resolve the pronoun. It could be an epiphenomenon — a side effect of the actual computation happening in the value vectors and subsequent layers. Use attention visualization for hypothesis generation, not for claiming understanding.

Despite these limitations, attention visualization remains valuable as a hypothesis generation tool. When you see a suggestive pattern, you can test it with probing, ablation, or causal interventions. Attention shows you where to look; other methods tell you what you're looking at.

A researcher finds that an attention head makes "it" attend to "dog" in "The dog ran because it was excited." They conclude the model understands coreference. Is this conclusion valid?

Yes — the attention pattern perfectly matches the coreference No — the model can't possibly understand coreference Not from this evidence alone — attention is correlational, not causal. The pattern could be an epiphenomenon. They should test with causal interventions (e.g., ablating that head and checking if coreference accuracy drops)

Chapter 3: Sparse Autoencoders

Probing asks "is property X encoded here?" but you have to know what to look for. What if you want to discover what the model has learned without presupposing the answer? This is the promise of Sparse Autoencoders (SAEs) — the most exciting recent development in interpretability.

The core problem: a single neuron in a language model typically doesn't correspond to a single concept. A neuron might activate for "dogs," "loyalty," "the number 7," and "sentences ending in periods" — all at once. This is called superposition: the model stores more concepts than it has neurons by encoding concepts as directions in activation space, not as individual neurons.

Think of it this way. Your model has 768 dimensions per token. But it might encode 10,000 distinct concepts. How? By using directions in the 768-dimensional space, not individual axes. The "dog" concept might be a direction at 45° between dimensions 17 and 342. The "loyalty" concept might be a different direction. Because the space is high-dimensional, thousands of nearly-orthogonal directions can coexist.

How SAEs Work

A Sparse Autoencoder learns to decompose a model's activations into a sum of interpretable features. It has two components:

Encoder: Maps the model's activation (dimension d) into a much larger feature space (dimension D >> d), then applies a sparsity constraint so that only a few features are active at any time.

Decoder: Maps from the sparse feature space back to the original dimension, reconstructing the activation.

h = encoder(x) = ReLU(W_enc · x + b_enc)
x̂ = decoder(h) = W_dec · h + b_dec

The key constraint: sparsity. The L1 penalty on the hidden layer ensures that only ~10-50 features activate for any given input, out of potentially 100,000+ total features. This forces each feature to be specific: instead of a neuron that fires for "dogs + loyalty + 7 + periods," the SAE learns separate features for each concept.

python
import torch
import torch.nn as nn

class SparseAutoencoder(nn.Module):
    def __init__(self, d_model=768, n_features=65536):
        super().__init__()
        self.encoder = nn.Linear(d_model, n_features)
        self.decoder = nn.Linear(n_features, d_model, bias=False)

    def forward(self, x):
        # Encode: project to high-dim, sparsify with ReLU
        h = torch.relu(self.encoder(x))  # [batch, 65536] — mostly zeros

        # Decode: reconstruct original activation
        x_hat = self.decoder(h)  # [batch, 768]

        # Loss: reconstruction + sparsity
        recon_loss = (x - x_hat).pow(2).mean()
        sparsity_loss = h.abs().mean()  # L1 penalty

        return x_hat, recon_loss + 0.01 * sparsity_loss

What SAE Features Look Like

When you train an SAE on a language model and examine what activates each feature, you find remarkably specific concepts:

Feature	Activates On	Interpretation
#4721	"Paris", "London", "Tokyo", "Berlin"	Capital cities
#892	"if", "unless", "provided that", "assuming"	Conditional expressions
#15003	"The Golden Gate Bridge", "bay", "fog", "San Francisco"	SF/Golden Gate concept
#33201	Code indentation, "def", "class", "return"	Python code structure
#7890	"dangerous", "risky", "harmful", "lethal"	Danger/harm concept

The famous "Golden Gate Bridge feature" (Anthropic, 2024) became a public demonstration: artificially amplifying this single SAE feature caused Claude to compulsively mention the Golden Gate Bridge in every response, regardless of the topic. This proved the feature was not just correlated with the concept but causally responsible for it.

SAE Feature Decomposition

A model activation (768 dims) is decomposed into ~5 active SAE features (out of 65K). Each feature represents a specific concept. Toggle features on/off to see how they contribute to the reconstructed activation.

5 / 5 features active — reconstruction: 97.2%

SAEs decompose superposition. A single neuron encodes many concepts simultaneously (superposition). An SAE separates them into distinct features — each with a clear interpretation. This is the most powerful discovery tool in modern interpretability: you don't need to know what to look for. The SAE finds the concepts the model has learned.

The Superposition Hypothesis in Detail

Why do models use superposition at all? Elhage et al. (2022) showed that superposition is computationally efficient. If a model needs to represent 10,000 concepts but has only 768 dimensions, it has two options:

Option A: Dedicated neurons. Assign each concept to one neuron. Only 768 concepts can be represented. This is clean but severely capacity-limited.

Option B: Superposition. Represent each concept as a direction in 768-dimensional space. In high dimensions, random directions are nearly orthogonal: two random 768-dimensional vectors have expected cosine similarity ~0.04. So you can pack ~10,000 nearly-orthogonal directions into 768 dimensions, with only small interference between them.

The math: in d dimensions, you can fit approximately d² nearly-orthogonal directions (with cosine similarity < 1/√d). For d=768, that's ~590,000 directions. Models have far more capacity for concepts than their raw dimensionality suggests.

max concepts ≈ d² = 768² = 589,824

But there's a cost: interference. When concept A is active, it slightly activates the neurons for concepts B, C, D that happen to have partially overlapping directions. The SAE separates these by projecting into a much higher-dimensional space (65,536 dimensions) where each concept gets its own dedicated feature with near-zero interference.

Scaling SAEs: From Toy Models to Frontier Models

The SAE approach has been validated at increasing scales:

Study	Model	SAE Size	Key Finding
Cunningham et al. 2023	GPT-2 Small	32K features	Found interpretable features for syntax, semantics, formatting
Bricken et al. 2023	Claude (1-layer)	4K features	First demonstration of Golden Gate Bridge feature
Templeton et al. 2024	Claude 3 Sonnet	34M features	Scaled to production model; found safety-relevant features
OpenAI 2024	GPT-4	16M features	Discovered features for deception, sycophancy, refusal

The trend is clear: SAEs scale to frontier models and find features that are both interpretable and causally relevant. Manipulating a single SAE feature can change model behavior in predictable ways, confirming that these features aren't just correlational artifacts.

Limitations of SAEs

SAEs aren't perfect. Key limitations include:

Reconstruction fidelity. SAEs don't perfectly reconstruct the original activations. The gap (typically 2-5% reconstruction error) means some information is lost in the decomposition. The features we find are a good approximation of the model's internal representation, not an exact description.

Feature splitting. Sometimes one concept splits across multiple SAE features. "Dogs" might activate features for "canine," "pet," and "furry animal." It's not always clear whether these are genuinely distinct concepts or artifacts of the SAE training.

Scale. Training SAEs on the largest models requires enormous compute. An SAE for GPT-4 with 100K features per layer, across all layers, would have billions of parameters itself.

Polysemy. Some features are polysemous: a single SAE feature fires on multiple unrelated concepts. This could mean the SAE hasn't fully decomposed superposition, or it could mean the model genuinely treats these concepts as related (even if humans wouldn't).

A single neuron in a language model activates for "dogs," "loyalty," and "the number 7." This is called superposition. What does a Sparse Autoencoder do to address this?

It decomposes the neuron's activation into separate sparse features — one for "dogs," one for "loyalty," one for "7" — by projecting into a much higher-dimensional space with a sparsity constraint so each feature is specific It removes the neuron from the model It retrains the model so each neuron represents exactly one concept

Chapter 4: Agentic Interpretability

SAEs produce thousands of features. But who labels them? The traditional approach is manual: a researcher examines the top-activating examples for each feature and writes a description. "Feature #4721 activates on capital cities." This doesn't scale. A single SAE might have 65,000+ features. Manually labeling all of them would take years.

The solution is recursive: use AI to interpret AI. This is agentic interpretability, pioneered by Anthropic (Bills et al. 2023, Bricken et al. 2023). An AI agent examines each SAE feature, proposes a label, tests its hypothesis against held-out examples, and refines the description iteratively.

The Agent Loop

1. Observe

Agent examines top-20 activating examples for a feature. "This feature fires on: Paris, London, Tokyo, capital, government, Washington D.C."

↓

2. Hypothesize

Agent proposes a label: "This feature detects capital cities of countries."

↓

3. Test

Agent crafts test inputs: "Berlin" (should activate), "Munich" (shouldn't), "Springfield" (ambiguous). Checks against actual feature activations.

↓

4. Refine

If tests reveal exceptions (feature also fires on "seat of government"), agent revises: "Seats of government, including national capitals and state capitals."

↻ repeat until confident

Agentic Interpretation Loop

Watch an AI agent interpret SAE features step by step. Click "Next Step" to advance through observation, hypothesis, testing, and refinement.

Step 1 / 4: Observing examples

Why Agents Are Better Than Humans

AI agents for interpretability have several advantages over human labelers:

Scale. An agent can label 65,000 features in hours. A human team would need months.

Consistency. The agent applies the same labeling criteria to every feature. Humans get tired, distracted, and inconsistent.

Active testing. The agent can generate arbitrary test inputs to probe the feature boundary. A human is limited to the examples in the dataset.

Iteration. The agent can refine its hypothesis through multiple rounds of testing. Each round sharpens the description. This is the scientific method automated: observe, hypothesize, test, refine.

Scoring Agent Labels

How do you evaluate whether the agent's labels are good? Two approaches:

Simulation accuracy. Given the agent's description and a new input, can the agent predict whether the feature will activate? If the description is "capital cities" and the input is "Paris," the agent should predict high activation. If the input is "pizza," low activation. The fraction of correct predictions is the simulation accuracy.

Human agreement. Show human raters the agent's labels alongside the top-activating examples. Do humans agree the label is accurate? Studies show 80-90% agreement rates for well-trained agents.

Agentic interpretability scales interpretability research from "a few features per paper" to "all features at once." By using language models to interpret language model features, we can label hundreds of thousands of SAE features automatically. The agent applies the scientific method: observe examples, form a hypothesis, test it against new data, and refine. This makes comprehensive model auditing feasible for the first time.

An AI agent labels SAE feature #892 as "conditional expressions." It tests this by checking activation on "if," "unless," and "provided that" (all activate) and "apple," "running," "blue" (none activate). Why does the agent also test "in case" and "whether"?

To increase the dataset size To test the boundaries of its hypothesis — "in case" and "whether" are edge cases that might or might not count as conditional expressions, and their activation pattern reveals whether the feature captures a narrow or broad notion of conditionality To find bugs in the SAE

Chapter 5: Concept Discovery

The showcase of this lesson: an interactive explorer that simulates how interpretability tools discover concepts inside a neural network. You'll see how probes, attention, and SAE features combine to build a picture of what a model has learned.

McGrath et al. (2022) applied SAEs to AlphaZero (the chess/Go engine) and discovered that the network had independently learned human chess concepts: material balance, king safety, pawn structure, piece activity. Nobody programmed these concepts — AlphaZero learned them from self-play alone. The SAE revealed that the network "thinks" in concepts remarkably similar to how grandmasters describe their thinking.

Concept Discovery Explorer

Explore a simulated model's internal features. Select a layer, then click features to see what concepts they encode. Drag the activation threshold to see which features respond to different inputs.

Layer Layer 6

Threshold 0.50

What the Explorer Shows

The explorer simulates three layers of interpretability analysis:

Feature activations (top): Each bar represents an SAE feature. Taller bars = stronger activation for the current input. Most features are near zero (sparsity); the few active ones correspond to concepts relevant to the input.

Concept clusters (middle): Features that co-activate on similar inputs form clusters. "Paris is beautiful" activates the geography cluster, the aesthetics cluster, and the European culture cluster. "def main():" activates the programming cluster and the Python-specific cluster.

Layer progression (bottom): As you change layers, the active features shift from surface-level (character patterns, token identity) to semantic-level (meaning, intent, safety). This mirrors what probing studies find: early layers = syntax, late layers = semantics.

From Features to Circuits

Individual features are interesting, but the real prize is circuits: connected pathways of features across layers that implement specific behaviors. A circuit might look like:

Layer 2, Feature #301

"Detects 'not' and negation words"

↓ feeds into

Layer 5, Feature #4412

"Detects negated sentiment (not happy, not good)"

↓ feeds into

Layer 9, Feature #12001

"Classifies overall text sentiment as negative"

Finding these circuits is the frontier of interpretability research. If we can map all the circuits in a language model, we would have a complete understanding of how it processes language — a full reverse-engineering of the neural network.

Currently, circuit discovery is painstaking. Researchers trace connections manually, looking at which features in layer L+1 are influenced by features in layer L through the attention and MLP pathways. This is where agentic methods become essential: AI agents can explore the circuit graph far faster than humans, proposing and testing hypotheses about inter-feature connections.

The ultimate goal of mechanistic interpretability: a complete circuit diagram of a language model. Just as we can trace the logic of a computer program through its source code, interpretability aims to trace the "logic" of a neural network through its features and circuits. We're not there yet — but SAEs and agentic interpretability are making progress faster than anyone expected.

Known Circuits

Despite the difficulty, several circuits have been fully mapped in smaller models:

Circuit	Model	What It Does	Discovered By
Induction heads	GPT-2	Copy patterns: if "A B...A" appears, predict "B" next	Olsson et al. 2022
Indirect Object ID	GPT-2	In "Alice gave Bob the ball. She gave _ the", predict "Bob"	Wang et al. 2022
Greater-than	GPT-2	Compare two numbers and output the larger	Hanna et al. 2023
Negation	Small transformers	Flip sentiment from positive to negative when "not" appears	Various 2023-2024

The induction head circuit is particularly important because it implements a fundamental capability: in-context learning. When the model sees "Harry Potter...Harry" and predicts "Potter," it's using a two-step circuit. In layer 1, a "previous token head" copies information about what token preceded each occurrence of "Harry." In layer 2, an "induction head" looks for the pattern "I've seen this token before" and copies the next token from the previous occurrence. This two-layer circuit is one of the simplest building blocks of in-context learning.

Finding these circuits required hundreds of hours of manual analysis. The agentic approach (Ch 4) promises to accelerate this dramatically by having AI agents propose and test circuit hypotheses automatically.

Chapter 6: New Vocabulary for AI

Interpretability research is revealing something profound: language models develop internal representations that don't map neatly onto existing human vocabulary. We may need new words to describe what AI systems are doing.

Consider this: when a language model processes "The trophy doesn't fit in the suitcase because it is too big," it must determine that "it" refers to "trophy." Humans call this "coreference resolution." But what the model actually does may be fundamentally different from what humans do. The model doesn't have a concept of physical objects or spatial reasoning in the way we do. It has learned statistical patterns about pronoun reference from text alone.

Should we say the model "understands" coreference? "Computes" coreference? "Simulates" coreference? Each word carries different implications about what's happening inside. The choice of vocabulary shapes our thinking about AI capabilities and limitations.

The Analogy Problem

When we describe AI systems, we inevitably use analogies from human cognition: the model "knows," "thinks," "reasons," "understands." But these analogies can be misleading. A model doesn't "know" Paris is in France the way you know it — it has a statistical association between "Paris" and "France" that produces the right answer in many contexts but can fail in unexpected ways.

Analogy Mapper

Human concepts vs. what the model might actually be doing. Click terms to see the gap between the analogy and the mechanism.

Proposed New Terms

Several researchers have proposed vocabulary specifically for AI phenomena that don't have good human analogues:

New Term	What It Describes	Why Existing Terms Fail
Superposition	Encoding more features than dimensions	"Memory" implies retrieval; superposition is simultaneous encoding
Grokking	Sudden generalization long after memorization	"Learning" implies gradual; grokking is abrupt
In-context learning	Adapting behavior from the prompt alone	"Learning" usually implies weight updates; ICL has none
Sycophancy	Agreeing with the user regardless of truth	"Lying" implies intent; sycophancy is a learned pattern
Confabulation	Generating plausible but false information	"Hallucination" implies perception; models don't perceive

Even "hallucination" — the most common term for model fabrications — is an analogy from human psychology that may be misleading. Humans hallucinate when their perceptual system generates false inputs. Models don't have perception. They generate text that is locally coherent but factually wrong. "Confabulation" (from neuropsychology: filling in gaps in memory with plausible fabrications) might be more accurate.

The words we use to describe AI shape how we think about AI. Calling model outputs "understanding" implies human-like cognition. Calling them "statistical pattern matching" implies simplistic computation. The truth is probably somewhere in between, and we may need entirely new vocabulary to capture it accurately. Interpretability research is building the empirical foundation for that vocabulary by revealing what models actually do internally.

The Spectrum of Understanding

Rather than a binary "understands vs doesn't understand," interpretability research suggests a spectrum of computational competence:

Level	What the Model Does	Human Analogy	Evidence
0. Pattern matching	Copies surface statistics from training data	Parrot repeating phrases	Fails on novel phrasing of known facts
1. Soft rules	Has learned approximate regularities	Child who says "goed" instead of "went"	Overgeneralizes but captures the pattern
2. Compositional	Combines known pieces to handle novel inputs	Student applying formula to new problem	Succeeds on novel combinations of known components
3. Abstract	Has learned the underlying structure	Expert who can explain the "why"	Robust to distribution shift; transfers to new domains

Most current models operate somewhere between levels 1 and 2 for most tasks. They've learned soft rules that work most of the time but break in predictable ways. For a few narrow tasks (like syntactic agreement), they may reach level 3. For others (like causal reasoning), they rarely exceed level 1.

Interpretability tools give us the evidence to make these distinctions. Instead of arguing about whether models "understand," we can ask the precise question: "Does the model have an internal representation that corresponds to concept X, and does it use that representation causally in producing output Y?" This is empirically testable with SAEs and causal interventions.

Why New Vocabulary Matters for Safety

The vocabulary problem isn't just philosophical — it has practical implications for AI safety. If we say a model "knows" something is dangerous, we might assume it will reliably avoid that thing. But if the model merely has a statistical association between "dangerous" and "avoid," that association can be overridden by conflicting patterns. The word "knows" creates false confidence in the model's reliability.

More precise vocabulary leads to more precise safety claims. Instead of "the model knows not to generate harmful content," we should say "the model has a feature (SAE #7890) that activates on danger-related inputs and, through circuit [X→Y→Z], suppresses generation of certain token sequences. This suppression can be bypassed by inputs that don't trigger feature #7890." This is ugly but accurate — and accuracy matters when deploying systems that affect millions of people.

Precise vocabulary enables precise safety guarantees. "The model understands safety" is unfalsifiable. "SAE feature #7890 detects danger concepts and suppresses certain outputs through circuit X→Y→Z" is testable, auditable, and improvable. Interpretability transforms vague claims about model behavior into specific, verifiable statements about internal mechanisms.

A researcher says "GPT-4 understands French grammar." Based on this lecture, what is the most precise criticism of this statement?

GPT-4 doesn't process French text at all "Understands" is an analogy from human cognition that may be misleading — GPT-4 has learned statistical patterns that produce grammatically correct French, but whether this constitutes "understanding" depends on your definition, and interpretability research hasn't yet determined if the internal mechanism resembles human grammatical knowledge The statement is perfectly fine and precise

Chapter 7: Connections

Interpretability is evolving from "academic curiosity" to "essential infrastructure." As models become more capable and more widely deployed, understanding their internals becomes both more important and more feasible.

Key Papers

Paper	Contribution	Connection
Agentic Interpretability (Anthropic 2025)	Using AI agents to automatically interpret SAE features	Core of Ch 4 — scalable feature labeling
Acquisition of Chess Knowledge in AlphaZero (McGrath 2022)	SAEs discover human chess concepts in a self-play agent	Concept discovery beyond language (Ch 5)
New Vocabulary for AI (Butlin 2024)	Argues we need new terms for AI phenomena	Vocabulary problem (Ch 6)
Neologism Learning in LLMs (Kim 2025)	How models acquire new words from context	Models create internal "neologisms" too (Ch 6)

Lecture Connections

Lecture	Relationship
L05: Transformers	Attention visualization (Ch 2) only makes sense if you understand how attention works. Probing (Ch 1) examines the representations that transformers produce at each layer.
L13: Reasoning Part 2	Process Reward Models verify reasoning steps. SAEs could reveal what the PRM has learned about "correct" vs "incorrect" reasoning — connecting interpretability to reliability.
L14: Tokenization	SAE features often correspond to token-level patterns (specific subword tokens, token positions). Understanding tokenization helps interpret what features are detecting at the input level.

Historical Arc

Era	Method	Insight
2018-2019	Linear probing (Tenney, Hewitt)	Transformers encode linguistic hierarchy across layers
2019-2020	Attention analysis (Clark, Vig)	Some heads implement specific syntactic operations
2021-2022	Circuit discovery (Elhage, Olsson)	Induction heads implement in-context learning
2022-2023	Superposition theory (Elhage et al.)	Models pack more features than dimensions
2023-2024	Sparse autoencoders at scale (Bricken, Templeton)	Individual features can be isolated and manipulated
2024-2025	Agentic interpretability (Anthropic)	AI can interpret AI at scale; 34M features labeled

The Big Picture

Interpretability Toolkit

How the interpretability methods we covered relate to each other, from simple (probing) to comprehensive (agentic interpretability).

What We Covered

Method	What It Reveals	Limitation
Linear Probing	What information is linearly encoded at each layer	Correlational: accessible ≠ used
Attention Viz	Token-level attention patterns per head	Correlational: attention ≠ explanation
SAEs	Sparse, interpretable features the model has learned	Reconstruction isn't perfect; features may split
Agentic Interp.	Automated labels for thousands of features	Agent accuracy depends on the interpreting model
Circuit Discovery	Connected pathways implementing specific behaviors	Extremely labor-intensive; early stages

Looking Ahead

Interpretability is one of the fastest-growing areas of AI research. Several trends will define the next few years:

1. Automated interpretability at scale. Agentic methods will enable comprehensive auditing of frontier models. Instead of studying a few features by hand, we'll have automated pipelines that catalog every feature and circuit, producing "interpretability reports" for each model release.

2. Interpretability-informed training. If we can identify the features responsible for harmful behavior, we can penalize those features during training, creating models that are safe by construction rather than by post-hoc patching.

3. Runtime monitoring. SAE features could be monitored during inference, flagging when danger-related or deception-related features activate. This would create an interpretability-based safety layer that complements existing guardrails.

4. Cross-model comparison. As SAE methods mature, we'll be able to compare the internal representations of different models: does Claude have the same "capital cities" feature as GPT-4? Do they implement arithmetic through the same circuit? This could reveal universal computational motifs in neural networks.

5. Connecting to neuroscience. The methods developed for AI interpretability (probing, feature decomposition, circuit tracing) are increasingly being applied to brain imaging data. Conversely, neuroscience concepts (superposition, sparse coding, hierarchical processing) inspire interpretability research. The cross-pollination is accelerating.

6. Interpretability-based model editing. If we can identify the circuit responsible for a specific behavior, we can surgically edit it — changing the model's behavior on one task without affecting others. This is far more precise than retraining or fine-tuning. Early work on "activation patching" and "causal tracing" demonstrates this is feasible for simple behaviors. Extending it to complex behaviors at scale is the challenge.

7. Standardized interpretability benchmarks. The field currently lacks standardized benchmarks. How do you measure whether an interpretability method "works"? Proposals include simulation accuracy (can the method predict model behavior?), causal faithfulness (do the identified features actually drive behavior?), and human agreement (do humans find the explanations meaningful?). Establishing these benchmarks will be crucial for comparing methods rigorously.

"What I cannot create, I do not understand." — Richard Feynman. Interpretability inverts this: what I cannot explain, I do not control. As AI systems become more powerful, our ability to understand them determines whether we can deploy them safely. The tools in this lecture — probes, SAEs, agentic labeling — are building the vocabulary and methodology for that understanding.