Peering inside the black box — probes, attention maps, sparse autoencoders, and the search for concepts in billions of parameters.
A language model generates a toxic response. Why? A model refuses a harmless request. Why? A model aces the bar exam but can't count to ten reliably. Why? These aren't rhetorical questions — they're the central challenge of interpretability: understanding what's actually happening inside neural networks.
Consider GPT-4. It has roughly 1.8 trillion parameters organized into hundreds of layers. When it writes a poem, which parameters are doing the "creativity"? When it solves a math problem, which circuits implement the arithmetic? When it refuses a prompt, is that safety training or a bug? We genuinely don't know.
This is unlike any technology humanity has built before. We can open a car engine and trace every part. We can read the source code of a program. But a neural network? It's a massive matrix of floating-point numbers. The "source code" is the weights, and they're not human-readable.
A prompt goes in, a response comes out. What happens in between? Click to "open" the black box and see the layers — but even looking at the numbers tells you almost nothing.
The simulation above captures the frustration of interpretability research: even when you look inside the model, you see millions of numbers that don't obviously correspond to concepts. The goal of interpretability is to build tools that translate those numbers into human-understandable explanations.
Interpretability isn't just academic curiosity. It has urgent practical implications:
Safety. If we can't understand why a model produces harmful outputs, we can't reliably prevent them. Current safety techniques (RLHF, red-teaming) are behavioral — they shape outputs without understanding internals. Interpretability could enable mechanistic safety: understanding and modifying the actual circuits that produce harmful behavior.
Trust. Medicine, law, and finance need to know why a model made a decision, not just what the decision was. A doctor won't trust a diagnosis model that says "cancer" without explaining its reasoning in terms that map to medical knowledge.
Debugging. When a model fails, interpretability helps find the cause. Is it a training data problem? A representation problem? A specific circuit that computes the wrong thing? Without interpretability, debugging is trial and error.
The simplest interpretability technique: take a model's internal representations and ask "what information is encoded here?" This is probing, and it works by training a small classifier (the probe) on top of frozen model representations.
Here's the setup. You have a pretrained language model. You feed in a sentence like "The cat sat on the mat." At each layer, the model produces a hidden representation for each token — a vector of dimension d (e.g., 768 for BERT-base). The question is: what does that vector encode?
To find out, you train a linear probe: a simple linear classifier (one weight matrix + bias) that takes the hidden representation as input and predicts some linguistic property. For example:
| Property | Task | Labels |
|---|---|---|
| Part of speech | Classify each token's POS tag | noun, verb, adjective, ... |
| Dependency relation | Predict syntactic role | subject, object, modifier, ... |
| Named entity | Classify entity type | person, location, org, ... |
| Semantic role | Who did what to whom | agent, patient, instrument, ... |
| Coreference | Do two tokens refer to the same entity? | yes / no |
If the linear probe achieves high accuracy, the information must be linearly accessible in the representation — meaning the model has organized its internal space so that this property is easy to read out with a simple linear transformation.
Where h is the hidden representation (shape [d_model]), W is the probe's weight matrix (shape [num_classes, d_model]), and b is the bias (shape [num_classes]). The probe is trained with cross-entropy loss, with model weights frozen.
Why restrict the probe to a linear classifier? Because a sufficiently powerful probe (e.g., a deep neural network) could compute the property from raw features, rather than simply reading it from the representation. A 5-layer MLP probe with 95% accuracy doesn't tell you the representation encodes POS tags — it might tell you the MLP learned to parse syntax from character-level features.
A linear probe is deliberately weak. If it achieves high accuracy, the information must already be organized linearly in the representation. The probe isn't computing anything complex — it's drawing a hyperplane through the representation space that separates the classes. The model did the hard work of arranging representations so that this hyperplane exists.
The most revealing experiment: train separate probes at every layer of the model. This reveals where different types of information emerge.
Accuracy of linear probes trained at each layer of a 12-layer transformer. Surface features (POS) peak early, syntactic features peak in middle layers, semantic features peak in later layers. Click different properties to see their profiles.
A consistent pattern emerges across many studies (Tenney et al. 2019, Hewitt & Manning 2019):
Layers 1-3 (early): Surface-level features peak. Part-of-speech tags, word identity, and morphological features are most accessible here. The model first represents what each token is.
Layers 4-8 (middle): Syntactic features peak. Dependency relations, constituent structure, and agreement patterns become most accessible. The model has figured out how tokens relate to each other structurally.
Layers 9-12 (late): Semantic features peak. Sentiment, coreference, semantic roles, and world knowledge are most accessible. The model has built an understanding of what the text means.
This creates a picture of transformers as information processing pipelines: raw text → morphology → syntax → semantics. Each layer refines the representation from surface-level toward meaning-level, much like the classical NLP pipeline but learned end-to-end.
python import torch from transformers import BertModel, BertTokenizer model = BertModel.from_pretrained('bert-base-uncased') tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') # Get hidden states at every layer inputs = tokenizer("The cat sat on the mat", return_tensors="pt") with torch.no_grad(): outputs = model(**inputs, output_hidden_states=True) # outputs.hidden_states is a tuple of 13 tensors # Shape: [1, seq_len, 768] for each layer (0 = embedding, 1-12 = layers) layer_6 = outputs.hidden_states[6] # [1, 8, 768] # Train a linear probe on layer 6 for POS tagging: probe = torch.nn.Linear(768, num_pos_tags) # simple linear layer # Freeze BERT, only train probe weights # → 93% accuracy = POS info is linearly encoded in layer 6
Hewitt & Manning (2019) went beyond classifying individual tokens to probing for tree structure. They trained a probe to predict the distance between any two words in the dependency parse tree, directly from the model's representations.
The key insight: if there exists a linear transformation B such that the squared distance between transformed representations approximates the tree distance, then the parse tree is linearly encoded in the representation space.
Where hi and hj are the hidden representations of words i and j, and B is a learned matrix (the structural probe). If this approximation is accurate, it means the model has organized its representation space so that syntactically related words are close together and syntactically distant words are far apart — the tree is literally embedded as a geometric structure in the representation.
The result was striking: BERT's representations encode parse trees with remarkable fidelity. The structural probe could reconstruct dependency trees with ~80% accuracy from just the hidden representations, without any explicit parsing. The model has learned a geometric encoding of syntax as a byproduct of masked language modeling.
python class StructuralProbe(nn.Module): """Learns a linear transform B such that ||B*h_i - B*h_j||^2 ≈ tree_distance(i,j)""" def __init__(self, d_model=768, rank=64): super().__init__() self.B = nn.Parameter(torch.randn(rank, d_model)) def forward(self, h_i, h_j): # h_i, h_j: [batch, d_model] diff = self.B @ (h_i - h_j).T # [rank, batch] return (diff ** 2).sum(dim=0) # [batch] = predicted tree distance
Attention weights are the most intuitive window into a transformer. Each attention head produces a matrix of weights: how much each token "attends to" every other token. Visualizing these weights creates attention heatmaps that sometimes reveal interpretable patterns.
In a sentence like "The cat sat on the mat because it was tired", which tokens does "it" attend to? If the model has learned coreference, "it" should attend strongly to "cat" (its referent). If we see this in the attention weights, it suggests the model is performing something like coreference resolution internally.
Click on a token to see what it attends to. Brighter cells = stronger attention. Different heads learn different patterns. Try switching heads to see positional, syntactic, and semantic attention patterns.
Clark et al. (2019) analyzed BERT's 144 attention heads (12 layers × 12 heads) and found recurring patterns:
Positional heads: Some heads attend primarily to the previous token, the next token, or the first token in the sequence. These implement a kind of local context window, similar to a convolution.
Syntactic heads: Certain heads align with dependency relations. In "The cat sat", a syntactic head on "sat" attends to "cat" (its subject). These heads appear to implement aspects of syntactic parsing.
Separator heads: Some heads attend primarily to [SEP] or [CLS] tokens. Clark et al. argued these act as "no-op" heads: when a token doesn't need to attend to anything meaningful, it dumps attention on a special token.
Coreference heads: Rare but fascinating: heads where pronouns attend to their antecedents. "It" attends to "cat" when "it" refers to the cat. These implement a form of coreference resolution.
Attention visualization is seductive but dangerous. The temptation is to interpret attention weights as "the model is looking at X because of Y." But this reasoning has serious flaws:
1. Attention is not explanation. Jain & Wallace (2019) showed that attention weights can be randomly permuted with minimal impact on model output. If the model produces the same answer with shuffled attention, the original attention pattern can't be "the reason" for the output.
2. Multiple heads. A transformer has many heads per layer and many layers. Visualizing one head shows one tiny slice of the model's computation. The actual computation combines all heads through residual connections. Cherry-picking one head that shows an interpretable pattern ignores the 143 other heads.
3. Residual stream. In modern interpretability frameworks, the important object is not individual attention weights but the residual stream: the cumulative sum of all layer outputs. Each layer reads from and writes to this shared stream. Attention weights are just part of how one layer contributes.
Despite these limitations, attention visualization remains valuable as a hypothesis generation tool. When you see a suggestive pattern, you can test it with probing, ablation, or causal interventions. Attention shows you where to look; other methods tell you what you're looking at.
Probing asks "is property X encoded here?" but you have to know what to look for. What if you want to discover what the model has learned without presupposing the answer? This is the promise of Sparse Autoencoders (SAEs) — the most exciting recent development in interpretability.
The core problem: a single neuron in a language model typically doesn't correspond to a single concept. A neuron might activate for "dogs," "loyalty," "the number 7," and "sentences ending in periods" — all at once. This is called superposition: the model stores more concepts than it has neurons by encoding concepts as directions in activation space, not as individual neurons.
Think of it this way. Your model has 768 dimensions per token. But it might encode 10,000 distinct concepts. How? By using directions in the 768-dimensional space, not individual axes. The "dog" concept might be a direction at 45° between dimensions 17 and 342. The "loyalty" concept might be a different direction. Because the space is high-dimensional, thousands of nearly-orthogonal directions can coexist.
A Sparse Autoencoder learns to decompose a model's activations into a sum of interpretable features. It has two components:
Encoder: Maps the model's activation (dimension d) into a much larger feature space (dimension D >> d), then applies a sparsity constraint so that only a few features are active at any time.
Decoder: Maps from the sparse feature space back to the original dimension, reconstructing the activation.
The key constraint: sparsity. The L1 penalty on the hidden layer ensures that only ~10-50 features activate for any given input, out of potentially 100,000+ total features. This forces each feature to be specific: instead of a neuron that fires for "dogs + loyalty + 7 + periods," the SAE learns separate features for each concept.
python import torch import torch.nn as nn class SparseAutoencoder(nn.Module): def __init__(self, d_model=768, n_features=65536): super().__init__() self.encoder = nn.Linear(d_model, n_features) self.decoder = nn.Linear(n_features, d_model, bias=False) def forward(self, x): # Encode: project to high-dim, sparsify with ReLU h = torch.relu(self.encoder(x)) # [batch, 65536] — mostly zeros # Decode: reconstruct original activation x_hat = self.decoder(h) # [batch, 768] # Loss: reconstruction + sparsity recon_loss = (x - x_hat).pow(2).mean() sparsity_loss = h.abs().mean() # L1 penalty return x_hat, recon_loss + 0.01 * sparsity_loss
When you train an SAE on a language model and examine what activates each feature, you find remarkably specific concepts:
| Feature | Activates On | Interpretation |
|---|---|---|
| #4721 | "Paris", "London", "Tokyo", "Berlin" | Capital cities |
| #892 | "if", "unless", "provided that", "assuming" | Conditional expressions |
| #15003 | "The Golden Gate Bridge", "bay", "fog", "San Francisco" | SF/Golden Gate concept |
| #33201 | Code indentation, "def", "class", "return" | Python code structure |
| #7890 | "dangerous", "risky", "harmful", "lethal" | Danger/harm concept |
The famous "Golden Gate Bridge feature" (Anthropic, 2024) became a public demonstration: artificially amplifying this single SAE feature caused Claude to compulsively mention the Golden Gate Bridge in every response, regardless of the topic. This proved the feature was not just correlated with the concept but causally responsible for it.
A model activation (768 dims) is decomposed into ~5 active SAE features (out of 65K). Each feature represents a specific concept. Toggle features on/off to see how they contribute to the reconstructed activation.
Why do models use superposition at all? Elhage et al. (2022) showed that superposition is computationally efficient. If a model needs to represent 10,000 concepts but has only 768 dimensions, it has two options:
Option A: Dedicated neurons. Assign each concept to one neuron. Only 768 concepts can be represented. This is clean but severely capacity-limited.
Option B: Superposition. Represent each concept as a direction in 768-dimensional space. In high dimensions, random directions are nearly orthogonal: two random 768-dimensional vectors have expected cosine similarity ~0.04. So you can pack ~10,000 nearly-orthogonal directions into 768 dimensions, with only small interference between them.
The math: in d dimensions, you can fit approximately d² nearly-orthogonal directions (with cosine similarity < 1/√d). For d=768, that's ~590,000 directions. Models have far more capacity for concepts than their raw dimensionality suggests.
But there's a cost: interference. When concept A is active, it slightly activates the neurons for concepts B, C, D that happen to have partially overlapping directions. The SAE separates these by projecting into a much higher-dimensional space (65,536 dimensions) where each concept gets its own dedicated feature with near-zero interference.
The SAE approach has been validated at increasing scales:
| Study | Model | SAE Size | Key Finding |
|---|---|---|---|
| Cunningham et al. 2023 | GPT-2 Small | 32K features | Found interpretable features for syntax, semantics, formatting |
| Bricken et al. 2023 | Claude (1-layer) | 4K features | First demonstration of Golden Gate Bridge feature |
| Templeton et al. 2024 | Claude 3 Sonnet | 34M features | Scaled to production model; found safety-relevant features |
| OpenAI 2024 | GPT-4 | 16M features | Discovered features for deception, sycophancy, refusal |
The trend is clear: SAEs scale to frontier models and find features that are both interpretable and causally relevant. Manipulating a single SAE feature can change model behavior in predictable ways, confirming that these features aren't just correlational artifacts.
SAEs aren't perfect. Key limitations include:
Reconstruction fidelity. SAEs don't perfectly reconstruct the original activations. The gap (typically 2-5% reconstruction error) means some information is lost in the decomposition. The features we find are a good approximation of the model's internal representation, not an exact description.
Feature splitting. Sometimes one concept splits across multiple SAE features. "Dogs" might activate features for "canine," "pet," and "furry animal." It's not always clear whether these are genuinely distinct concepts or artifacts of the SAE training.
Scale. Training SAEs on the largest models requires enormous compute. An SAE for GPT-4 with 100K features per layer, across all layers, would have billions of parameters itself.
Polysemy. Some features are polysemous: a single SAE feature fires on multiple unrelated concepts. This could mean the SAE hasn't fully decomposed superposition, or it could mean the model genuinely treats these concepts as related (even if humans wouldn't).
SAEs produce thousands of features. But who labels them? The traditional approach is manual: a researcher examines the top-activating examples for each feature and writes a description. "Feature #4721 activates on capital cities." This doesn't scale. A single SAE might have 65,000+ features. Manually labeling all of them would take years.
The solution is recursive: use AI to interpret AI. This is agentic interpretability, pioneered by Anthropic (Bills et al. 2023, Bricken et al. 2023). An AI agent examines each SAE feature, proposes a label, tests its hypothesis against held-out examples, and refines the description iteratively.
Watch an AI agent interpret SAE features step by step. Click "Next Step" to advance through observation, hypothesis, testing, and refinement.
AI agents for interpretability have several advantages over human labelers:
Scale. An agent can label 65,000 features in hours. A human team would need months.
Consistency. The agent applies the same labeling criteria to every feature. Humans get tired, distracted, and inconsistent.
Active testing. The agent can generate arbitrary test inputs to probe the feature boundary. A human is limited to the examples in the dataset.
Iteration. The agent can refine its hypothesis through multiple rounds of testing. Each round sharpens the description. This is the scientific method automated: observe, hypothesize, test, refine.
How do you evaluate whether the agent's labels are good? Two approaches:
Simulation accuracy. Given the agent's description and a new input, can the agent predict whether the feature will activate? If the description is "capital cities" and the input is "Paris," the agent should predict high activation. If the input is "pizza," low activation. The fraction of correct predictions is the simulation accuracy.
Human agreement. Show human raters the agent's labels alongside the top-activating examples. Do humans agree the label is accurate? Studies show 80-90% agreement rates for well-trained agents.
The showcase of this lesson: an interactive explorer that simulates how interpretability tools discover concepts inside a neural network. You'll see how probes, attention, and SAE features combine to build a picture of what a model has learned.
McGrath et al. (2022) applied SAEs to AlphaZero (the chess/Go engine) and discovered that the network had independently learned human chess concepts: material balance, king safety, pawn structure, piece activity. Nobody programmed these concepts — AlphaZero learned them from self-play alone. The SAE revealed that the network "thinks" in concepts remarkably similar to how grandmasters describe their thinking.
Explore a simulated model's internal features. Select a layer, then click features to see what concepts they encode. Drag the activation threshold to see which features respond to different inputs.
The explorer simulates three layers of interpretability analysis:
Feature activations (top): Each bar represents an SAE feature. Taller bars = stronger activation for the current input. Most features are near zero (sparsity); the few active ones correspond to concepts relevant to the input.
Concept clusters (middle): Features that co-activate on similar inputs form clusters. "Paris is beautiful" activates the geography cluster, the aesthetics cluster, and the European culture cluster. "def main():" activates the programming cluster and the Python-specific cluster.
Layer progression (bottom): As you change layers, the active features shift from surface-level (character patterns, token identity) to semantic-level (meaning, intent, safety). This mirrors what probing studies find: early layers = syntax, late layers = semantics.
Individual features are interesting, but the real prize is circuits: connected pathways of features across layers that implement specific behaviors. A circuit might look like:
Finding these circuits is the frontier of interpretability research. If we can map all the circuits in a language model, we would have a complete understanding of how it processes language — a full reverse-engineering of the neural network.
Currently, circuit discovery is painstaking. Researchers trace connections manually, looking at which features in layer L+1 are influenced by features in layer L through the attention and MLP pathways. This is where agentic methods become essential: AI agents can explore the circuit graph far faster than humans, proposing and testing hypotheses about inter-feature connections.
Despite the difficulty, several circuits have been fully mapped in smaller models:
| Circuit | Model | What It Does | Discovered By |
|---|---|---|---|
| Induction heads | GPT-2 | Copy patterns: if "A B...A" appears, predict "B" next | Olsson et al. 2022 |
| Indirect Object ID | GPT-2 | In "Alice gave Bob the ball. She gave _ the", predict "Bob" | Wang et al. 2022 |
| Greater-than | GPT-2 | Compare two numbers and output the larger | Hanna et al. 2023 |
| Negation | Small transformers | Flip sentiment from positive to negative when "not" appears | Various 2023-2024 |
The induction head circuit is particularly important because it implements a fundamental capability: in-context learning. When the model sees "Harry Potter...Harry" and predicts "Potter," it's using a two-step circuit. In layer 1, a "previous token head" copies information about what token preceded each occurrence of "Harry." In layer 2, an "induction head" looks for the pattern "I've seen this token before" and copies the next token from the previous occurrence. This two-layer circuit is one of the simplest building blocks of in-context learning.
Finding these circuits required hundreds of hours of manual analysis. The agentic approach (Ch 4) promises to accelerate this dramatically by having AI agents propose and test circuit hypotheses automatically.
Interpretability research is revealing something profound: language models develop internal representations that don't map neatly onto existing human vocabulary. We may need new words to describe what AI systems are doing.
Consider this: when a language model processes "The trophy doesn't fit in the suitcase because it is too big," it must determine that "it" refers to "trophy." Humans call this "coreference resolution." But what the model actually does may be fundamentally different from what humans do. The model doesn't have a concept of physical objects or spatial reasoning in the way we do. It has learned statistical patterns about pronoun reference from text alone.
Should we say the model "understands" coreference? "Computes" coreference? "Simulates" coreference? Each word carries different implications about what's happening inside. The choice of vocabulary shapes our thinking about AI capabilities and limitations.
When we describe AI systems, we inevitably use analogies from human cognition: the model "knows," "thinks," "reasons," "understands." But these analogies can be misleading. A model doesn't "know" Paris is in France the way you know it — it has a statistical association between "Paris" and "France" that produces the right answer in many contexts but can fail in unexpected ways.
Human concepts vs. what the model might actually be doing. Click terms to see the gap between the analogy and the mechanism.
Several researchers have proposed vocabulary specifically for AI phenomena that don't have good human analogues:
| New Term | What It Describes | Why Existing Terms Fail |
|---|---|---|
| Superposition | Encoding more features than dimensions | "Memory" implies retrieval; superposition is simultaneous encoding |
| Grokking | Sudden generalization long after memorization | "Learning" implies gradual; grokking is abrupt |
| In-context learning | Adapting behavior from the prompt alone | "Learning" usually implies weight updates; ICL has none |
| Sycophancy | Agreeing with the user regardless of truth | "Lying" implies intent; sycophancy is a learned pattern |
| Confabulation | Generating plausible but false information | "Hallucination" implies perception; models don't perceive |
Even "hallucination" — the most common term for model fabrications — is an analogy from human psychology that may be misleading. Humans hallucinate when their perceptual system generates false inputs. Models don't have perception. They generate text that is locally coherent but factually wrong. "Confabulation" (from neuropsychology: filling in gaps in memory with plausible fabrications) might be more accurate.
Rather than a binary "understands vs doesn't understand," interpretability research suggests a spectrum of computational competence:
| Level | What the Model Does | Human Analogy | Evidence |
|---|---|---|---|
| 0. Pattern matching | Copies surface statistics from training data | Parrot repeating phrases | Fails on novel phrasing of known facts |
| 1. Soft rules | Has learned approximate regularities | Child who says "goed" instead of "went" | Overgeneralizes but captures the pattern |
| 2. Compositional | Combines known pieces to handle novel inputs | Student applying formula to new problem | Succeeds on novel combinations of known components |
| 3. Abstract | Has learned the underlying structure | Expert who can explain the "why" | Robust to distribution shift; transfers to new domains |
Most current models operate somewhere between levels 1 and 2 for most tasks. They've learned soft rules that work most of the time but break in predictable ways. For a few narrow tasks (like syntactic agreement), they may reach level 3. For others (like causal reasoning), they rarely exceed level 1.
Interpretability tools give us the evidence to make these distinctions. Instead of arguing about whether models "understand," we can ask the precise question: "Does the model have an internal representation that corresponds to concept X, and does it use that representation causally in producing output Y?" This is empirically testable with SAEs and causal interventions.
The vocabulary problem isn't just philosophical — it has practical implications for AI safety. If we say a model "knows" something is dangerous, we might assume it will reliably avoid that thing. But if the model merely has a statistical association between "dangerous" and "avoid," that association can be overridden by conflicting patterns. The word "knows" creates false confidence in the model's reliability.
More precise vocabulary leads to more precise safety claims. Instead of "the model knows not to generate harmful content," we should say "the model has a feature (SAE #7890) that activates on danger-related inputs and, through circuit [X→Y→Z], suppresses generation of certain token sequences. This suppression can be bypassed by inputs that don't trigger feature #7890." This is ugly but accurate — and accuracy matters when deploying systems that affect millions of people.
Interpretability is evolving from "academic curiosity" to "essential infrastructure." As models become more capable and more widely deployed, understanding their internals becomes both more important and more feasible.
| Paper | Contribution | Connection |
|---|---|---|
| Agentic Interpretability (Anthropic 2025) | Using AI agents to automatically interpret SAE features | Core of Ch 4 — scalable feature labeling |
| Acquisition of Chess Knowledge in AlphaZero (McGrath 2022) | SAEs discover human chess concepts in a self-play agent | Concept discovery beyond language (Ch 5) |
| New Vocabulary for AI (Butlin 2024) | Argues we need new terms for AI phenomena | Vocabulary problem (Ch 6) |
| Neologism Learning in LLMs (Kim 2025) | How models acquire new words from context | Models create internal "neologisms" too (Ch 6) |
| Lecture | Relationship |
|---|---|
| L05: Transformers | Attention visualization (Ch 2) only makes sense if you understand how attention works. Probing (Ch 1) examines the representations that transformers produce at each layer. |
| L13: Reasoning Part 2 | Process Reward Models verify reasoning steps. SAEs could reveal what the PRM has learned about "correct" vs "incorrect" reasoning — connecting interpretability to reliability. |
| L14: Tokenization | SAE features often correspond to token-level patterns (specific subword tokens, token positions). Understanding tokenization helps interpret what features are detecting at the input level. |
| Era | Method | Insight |
|---|---|---|
| 2018-2019 | Linear probing (Tenney, Hewitt) | Transformers encode linguistic hierarchy across layers |
| 2019-2020 | Attention analysis (Clark, Vig) | Some heads implement specific syntactic operations |
| 2021-2022 | Circuit discovery (Elhage, Olsson) | Induction heads implement in-context learning |
| 2022-2023 | Superposition theory (Elhage et al.) | Models pack more features than dimensions |
| 2023-2024 | Sparse autoencoders at scale (Bricken, Templeton) | Individual features can be isolated and manipulated |
| 2024-2025 | Agentic interpretability (Anthropic) | AI can interpret AI at scale; 34M features labeled |
How the interpretability methods we covered relate to each other, from simple (probing) to comprehensive (agentic interpretability).
| Method | What It Reveals | Limitation |
|---|---|---|
| Linear Probing | What information is linearly encoded at each layer | Correlational: accessible ≠ used |
| Attention Viz | Token-level attention patterns per head | Correlational: attention ≠ explanation |
| SAEs | Sparse, interpretable features the model has learned | Reconstruction isn't perfect; features may split |
| Agentic Interp. | Automated labels for thousands of features | Agent accuracy depends on the interpreting model |
| Circuit Discovery | Connected pathways implementing specific behaviors | Extremely labor-intensive; early stages |
Interpretability is one of the fastest-growing areas of AI research. Several trends will define the next few years:
1. Automated interpretability at scale. Agentic methods will enable comprehensive auditing of frontier models. Instead of studying a few features by hand, we'll have automated pipelines that catalog every feature and circuit, producing "interpretability reports" for each model release.
2. Interpretability-informed training. If we can identify the features responsible for harmful behavior, we can penalize those features during training, creating models that are safe by construction rather than by post-hoc patching.
3. Runtime monitoring. SAE features could be monitored during inference, flagging when danger-related or deception-related features activate. This would create an interpretability-based safety layer that complements existing guardrails.
4. Cross-model comparison. As SAE methods mature, we'll be able to compare the internal representations of different models: does Claude have the same "capital cities" feature as GPT-4? Do they implement arithmetic through the same circuit? This could reveal universal computational motifs in neural networks.
5. Connecting to neuroscience. The methods developed for AI interpretability (probing, feature decomposition, circuit tracing) are increasingly being applied to brain imaging data. Conversely, neuroscience concepts (superposition, sparse coding, hierarchical processing) inspire interpretability research. The cross-pollination is accelerating.
6. Interpretability-based model editing. If we can identify the circuit responsible for a specific behavior, we can surgically edit it — changing the model's behavior on one task without affecting others. This is far more precise than retraining or fine-tuning. Early work on "activation patching" and "causal tracing" demonstrates this is feasible for simple behaviors. Extending it to complex behaviors at scale is the challenge.
7. Standardized interpretability benchmarks. The field currently lacks standardized benchmarks. How do you measure whether an interpretability method "works"? Proposals include simulation accuracy (can the method predict model behavior?), causal faithfulness (do the identified features actually drive behavior?), and human agreement (do humans find the explanations meaningful?). Establishing these benchmarks will be crucial for comparing methods rigorously.