Been Kim, Finale Doshi-Velez, et al. (Google DeepMind) — 2025

Agentic Interpretability

Because we have LLMs, we Can and Should Pursue Agentic Interpretability — use LLM agents to automate interpretability research itself: hypothesis generation, feature analysis, circuit discovery at scale.

Prerequisites: What a neural network feature/neuron is + Basic LLM concepts. That's it.
8
Chapters
8+
Simulations
0
Assumed Knowledge

Chapter 0: The Scale Problem

You're a neuroscientist studying the brain. Your subject has 86 billion neurons. You can record from one neuron at a time. At one neuron per minute, it would take you 163,000 years to look at each one. You'd never finish. You'd never even begin to understand the circuits.

This is exactly the situation in mechanistic interpretability. GPT-4 has hundreds of billions of parameters organized into millions of features. Researchers have been studying individual neurons, attention heads, and circuits by hand — painstakingly activating neurons, examining what inputs light them up, and forming hypotheses about their function. The results are beautiful: we've found "induction heads" that copy patterns, features that detect sentiment, circuits that perform arithmetic.

But hand-analysis doesn't scale. At the current rate of human interpretability research, we will never understand modern models before even larger ones are deployed.

ModelParametersEstimated FeaturesTime to hand-analyze
GPT-2 Small117M~50K~50 researcher-years
GPT-2 XL1.5B~500K~500 researcher-years
LLaMA-2 70B70B~10M~10,000 researcher-years
GPT-4~1.7T (est.)~100M+~100,000+ researcher-years
The core proposal: Use LLM agents to do interpretability research at machine speed. Instead of a human looking at neuron activations and forming hypotheses, an LLM agent examines activations, proposes hypotheses, designs experiments to test them, runs those experiments, and reports findings. The same capabilities that make LLMs useful for code generation and analysis make them potentially useful for understanding LLMs themselves — interpretability becomes self-referential.

Think of it like using microscopes to study microscopes. The tool becomes both the subject and the instrument of investigation. Kim et al. argue this isn't circular — it's practical. We don't need to fully understand LLMs to use them productively for interpretability research, just as we don't need to fully understand the brain to use it for neuroscience.

Scale of the Interpretability Challenge

Drag the slider to see how model size has outpaced human interpretability capacity. The gap between model complexity and our ability to understand it grows exponentially.

Year 2023
Why is hand-analysis insufficient for understanding modern LLMs?

Chapter 1: What Is Interpretability?

Before we can automate interpretability research, we need to be precise about what it means. Kim et al. define interpretability along three dimensions, each of which can be partially automated.

Three levels of understanding

Feature-level: What does a single neuron, attention head, or SAE (sparse autoencoder) feature respond to? Example: "This feature activates for mentions of the Golden Gate Bridge." This is the most basic level — labeling what individual components do.

Circuit-level: How do multiple features connect to implement a computation? Example: "These three attention heads form an induction circuit that copies repeated patterns." This is mechanistic — it explains how the model computes, not just what individual parts respond to.

Behavioral-level: How does the model behave on specific inputs, and why? Example: "The model is biased toward associating 'doctor' with 'male' because feature X in layer 12 encodes gender stereotypes." This connects internal mechanisms to observable behavior.

The interpretability stack. These three levels form a stack: features → circuits → behavior. Understanding one level enables understanding the next. Automated feature labeling (Level 1) is a prerequisite for automated circuit discovery (Level 2), which enables automated behavioral explanations (Level 3). Kim et al. argue that LLM agents are mature enough to automate Level 1 well and Level 2 partially.

The scientific method for interpretability

Good interpretability research follows the scientific method:

1. Observe
Look at neuron activations on many inputs. What patterns emerge?
2. Hypothesize
"This neuron fires for base-10 numbers." Form a testable prediction.
3. Experiment
Design inputs that test the hypothesis. Does the neuron fire for "42" but not "forty-two"?
4. Refine
Update the hypothesis based on results. "This neuron fires for multi-digit integers, not single digits."
↻ Iterate until hypothesis is precise and verified

Each of these steps involves pattern recognition, language understanding, and reasoning — exactly the capabilities LLMs excel at. The insight is that an LLM can perform this entire loop automatically, faster than a human researcher.

python
# The interpretability research loop — done by hand vs by agent

# HUMAN RESEARCHER (days to weeks per feature):
# 1. Record activations on 1000 inputs
# 2. Manually inspect top-activating inputs
# 3. Think: "these all seem to be about cooking..."
# 4. Generate test cases: "bake cake" vs "drive car"
# 5. Verify: feature activates for cooking, not driving
# 6. Publish: "Feature 3427 = cooking concept"

# LLM AGENT (minutes per feature):
# 1. Same activation recording (automated)
# 2. LLM reads top-activating texts, produces hypothesis
# 3. LLM generates diverse test cases
# 4. Tests are run automatically
# 5. LLM evaluates results, refines hypothesis
# 6. Outputs: "Feature 3427 = cooking-related verbs"
Interpretability Levels

Click each level to see what questions it answers and how LLM agents can automate it. Higher levels build on lower ones.

What are the three levels of interpretability, and why do they form a stack?

Chapter 2: LLMs as Research Tools

What makes LLMs suitable for interpretability research? Kim et al. identify four capabilities that align with the demands of interpretability work.

Capability 1: Pattern recognition in text

Given the top-20 texts that maximally activate a neuron, a human researcher reads them and identifies commonalities. An LLM can do the same — and it's often better at spotting subtle patterns across many examples.

python
# Example: auto-labeling with an LLM
top_activating_texts = [
    "She baked a chocolate cake for her birthday",
    "The chef prepared a soufflé in the kitchen",
    "He grilled the salmon to perfection",
    "Mix the flour and eggs until smooth",
    "The restaurant served excellent pasta",
]

prompt = f"""You are an interpretability researcher. Below are the top-5
texts that maximally activate a specific neuron in a language model.

{chr(10).join(top_activating_texts)}

What concept does this neuron likely encode? Be specific and testable."""

label = llm.generate(prompt)
# "This neuron encodes cooking/food-preparation verbs and
#  food-related nouns, particularly in the context of
#  preparing or serving meals."

Capability 2: Hypothesis generation

After forming an initial label, the LLM can generate testable hypotheses: "If this neuron encodes cooking verbs, it should NOT activate for 'The chef drove to the store' (no cooking verb) and SHOULD activate for 'I sautéed the onions' (cooking verb in novel context)."

Capability 3: Experiment design

The LLM can design targeted experiments — minimal pairs, ablation studies, counterfactual inputs — that would take a human researcher hours to construct.

Capability 4: Result interpretation

After running experiments, the LLM can interpret the results, update its hypothesis, and iterate. This closes the loop — making the research process fully automated.

The key philosophical point. Using an LLM to study itself might seem circular. But Kim et al. argue it's no more circular than using a microscope (made of atoms) to study atoms, or using the brain to study the brain. The LLM agent doesn't need to understand itself — it just needs to be good at pattern recognition, hypothesis generation, and language understanding. These are capabilities, not self-knowledge.

Example: full agent session

Let's trace a complete agent session on a real feature. The agent is analyzing Feature 4519 from a GPT-2 sparse autoencoder:

agent session trace
# === Agent Session: Feature 4519 ===
# Top-5 activating texts:
# 1. "The derivative of x² is 2x" (activation: 0.94)
# 2. "Solve for y in the equation 3y + 7 = 22" (0.91)
# 3. "The integral converges for |x| < 1" (0.89)
# 4. "Let f(x) = sin(x) + cos(x)" (0.87)
# 5. "Calculate the eigenvalues of matrix A" (0.85)
#
# Agent Hypothesis (round 1):
# "Mathematical expressions with variables"
# Simulation score: 0.72
#
# Agent tests: "The price was $42" → activates? YES (unexpected!)
# Agent tests: "Two plus two equals four" → activates? NO (expected)
# Agent tests: "The velocity v = d/t" → activates? YES (expected)
#
# Agent Hypothesis (round 2, refined):
# "Mathematical notation with symbolic variables (not spelled out numbers)"
# Simulation score: 0.86
#
# Agent tests: "f(x) = x" → YES ✓ | "the function of x" → NO ✓
# Agent tests: "$100" → YES (symbols count!) | "one hundred" → NO ✓
#
# Agent Hypothesis (round 3, final):
# "Symbolic mathematical notation: equations, formulas, and expressions
#  using mathematical symbols (=, +, ², ∫) and single-letter variables
#  (x, y, f, n), including currency symbols ($, €)"
# Simulation score: 0.93 → CONVERGED
Three rounds, three refinements. Round 1 caught the general pattern but missed the symbolic nature. Round 2 refined to "symbolic variables." Round 3 added the surprising finding about currency symbols (which are also mathematical notation in a broad sense). A human researcher would likely follow the same path — but the agent did it in minutes instead of hours.

Prior work: MILAN, neuron descriptions

OpenAI's "Language models can explain neurons in language models" (Bills et al., 2023) was a precursor. They used GPT-4 to label every neuron in GPT-2 — producing natural language descriptions of what each neuron responds to. Kim et al. extend this from passive labeling to active research: not just describing features, but discovering circuits, testing hypotheses, and proposing new interpretability methods.

LLM as Interpretability Researcher

Click "Analyze Feature" to watch an LLM agent examine neuron activations, form a hypothesis, generate test cases, and refine its label. This is the core agentic loop.

Why isn't using an LLM to interpret itself circular reasoning?

Chapter 3: The Agentic Loop

Kim et al. propose a specific architecture for interpretability agents. It's a loop — the agent iterates between observation, hypothesis, experiment, and refinement until it converges on a precise explanation.

The four-phase loop

Phase 1: Observe
Collect activations of the target feature on a diverse dataset. Record top-activating and bottom-activating examples.
Phase 2: Hypothesize
LLM examines examples and proposes a natural language description: "This feature activates for subordinate clauses starting with 'because'."
Phase 3: Experiment
LLM designs test inputs that probe the hypothesis. Minimal pairs: "He left because..." (should activate) vs "He left although..." (shouldn't). Run through the target model.
Phase 4: Refine
LLM compares predicted vs actual activations. If hypothesis is wrong, update: "Feature actually responds to all causal connectives, not just 'because'." Return to Phase 2.
↻ Repeat until convergence (hypothesis predicts activations accurately)

Convergence criterion

The loop terminates when the agent's hypothesis accurately predicts activations on a held-out test set. Specifically, the agent's description should let you predict whether the feature will activate on a new input with >90% accuracy. This is the simulation score — can the label "simulate" the feature's behavior?

Simulation Score = Accuracy(predicted activations | description, test inputs)
python
# The agentic interpretability loop
class InterpAgent:
    def __init__(self, target_model, agent_llm):
        self.target = target_model
        self.agent = agent_llm

    def analyze_feature(self, feature_idx, dataset, max_iters=5):
        # Phase 1: Observe
        activations = self.target.get_activations(feature_idx, dataset)
        top_examples = get_top_k(activations, k=20)
        bottom_examples = get_bottom_k(activations, k=20)

        hypothesis = None
        for i in range(max_iters):
            # Phase 2: Hypothesize
            hypothesis = self.agent.generate_hypothesis(
                top_examples, bottom_examples, prev_hypothesis=hypothesis
            )

            # Phase 3: Experiment
            test_cases = self.agent.design_experiments(hypothesis)
            predictions = self.agent.predict_activations(hypothesis, test_cases)
            actuals = self.target.get_activations(feature_idx, test_cases)

            # Phase 4: Refine
            score = compute_simulation_score(predictions, actuals)
            if score > 0.9:
                return hypothesis, score
            # Loop back with updated examples
            top_examples = update_with_failures(predictions, actuals)

        return hypothesis, score
Iterative refinement is the key. A single-pass label ("this neuron is about cooking") is often wrong or too broad. The agentic loop lets the agent discover edge cases and refine: "This neuron responds to cooking VERBS in active voice, not passive ('the cake was baked' doesn't activate). It also responds to eating verbs but with lower activation." This level of precision requires multiple rounds of hypothesis testing — exactly what the loop provides.

Failure modes and guardrails

The agentic loop can fail in several ways. Understanding these failure modes is essential for designing reliable agents:

Failure ModeWhat HappensGuardrail
Confirmation biasAgent generates tests that confirm its hypothesis rather than challenge itRequire adversarial test cases — inputs designed to DISPROVE the hypothesis
Overfit to examplesAgent describes the top-5 examples literally rather than abstracting the patternRequire hypothesis to predict activations on unseen inputs (out-of-distribution test)
Premature convergenceAgent declares success at 85% when more iterations would reach 95%Set minimum iteration count (at least 3 rounds) before allowing convergence
Hallucinated patternsAgent "finds" a pattern that doesn't exist (Type I error)Require statistical significance: simulation score must exceed random baseline by 3σ
python
# Guardrail: adversarial test case generation
def generate_adversarial_tests(hypothesis, agent):
    """Generate inputs designed to DISPROVE the hypothesis."""
    prompt = f"""Your hypothesis is: "{hypothesis}"

Generate 10 test inputs that would DISPROVE this hypothesis if it's wrong.
Focus on edge cases and near-misses.

Example: If hypothesis is "cooking verbs", generate:
- "The recipe book was on the shelf" (cooking context, no cooking verb)
- "She drove to the restaurant" (food context, no cooking)
- "He prepared the documents" (non-cooking use of "prepared")"""

    return agent.generate(prompt)
Agentic Loop Simulator

Click "Iterate" to run one cycle of the agentic loop. Watch how the hypothesis gets progressively refined with each round of testing. The simulation score improves as the agent learns edge cases.

What determines when the agentic interpretability loop stops iterating?

Chapter 4: Automated Feature Labeling

The most mature application of agentic interpretability is automated feature labeling — having an LLM describe what each feature in a model responds to. This has been validated at scale on sparse autoencoder (SAE) features.

SAE features: the target

Sparse Autoencoders (SAEs) decompose model activations into interpretable directions. An SAE with 16,384 features might decompose layer 12's activation space into 16,384 monosemantic features — each responding to a single concept. The problem: labeling 16,384 features by hand takes months.

python
# Automated feature labeling pipeline
from sae_lens import SAE

# 1. Load SAE trained on a target model
sae = SAE.load("gpt2-small-layer12-16384")

# 2. For each of 16,384 features:
for feat_idx in range(16384):
    # Get top-activating examples from a large corpus
    top_texts = sae.get_max_activating(feat_idx, corpus, k=20)

    # Ask LLM to label
    label = agent.label_feature(top_texts)
    # "Feature 7823: First-person pronouns in informal writing"
    # "Feature 12041: Geographic locations in Europe"
    # "Feature 4519: Mathematical expressions with variables"

# 3. Result: a complete dictionary of feature meanings
# 16,384 labeled features in ~2 hours (vs ~6 months by hand)

Scale comparison

The speed advantage of automated labeling is enormous:

MethodFeatures/hourTime for 16K featuresCost (est.)
Expert hand-labeling2-53,200-8,000 hours$320K-$800K (researcher salary)
Crowdsourced labeling20-50320-800 hours$16K-$40K (Mechanical Turk)
Single-pass LLM500-1,00016-32 hours$50-$100 (API costs)
Agentic loop (3 iterations)100-20080-160 hours$500-$1,000 (API costs)

The agentic loop is slower than single-pass labeling because each feature requires multiple rounds of hypothesis testing. But the quality improvement (85% vs 75% accuracy) is worth the 3-5x slowdown. And it's still 40-100x faster than expert hand-labeling.

python
# Cost breakdown for labeling 16,384 SAE features
# Using GPT-4 as the agent LLM:
#
# Per feature (agentic loop, 3 iterations):
#   - Observation prompt: ~500 tokens input, ~200 output
#   - Hypothesis prompt:  ~800 tokens input, ~100 output
#   - Test design:        ~400 tokens input, ~300 output
#   - Refinement:         ~600 tokens input, ~200 output
#   - Total per iteration: ~2300 input + ~800 output tokens
#   - Total per feature (3 iters): ~6900 input + ~2400 output
#
# 16,384 features × (6900 × $0.03/1K + 2400 × $0.06/1K)
# = 16,384 × ($0.207 + $0.144)
# = 16,384 × $0.351
# = $5,751 total
#
# Compare: expert labeling at $100/hour × 4000 hours = $400,000
# Automated labeling is 70x cheaper

Quality of automated labels

How good are LLM-generated labels? Bills et al. (2023) found that GPT-4 labels match human labels ~75% of the time and outperform simpler baselines. The agentic loop (with hypothesis testing) pushes this to ~85%. Some features are harder to label than others:

Feature TypeAuto-label QualityWhy
Concrete concepts~95%"Sports", "colors", "animals" — obvious from examples
Syntactic patterns~85%"Subordinate clauses", "list items" — requires linguistic knowledge
Abstract/compositional~65%"Irony", "contrast between expectations" — subtle, context-dependent
Polysemantic~40%Feature responds to multiple unrelated concepts — genuinely hard
The 85% threshold matters. Even imperfect labels are enormously valuable. Before automated labeling, the vast majority of features were unlabeled — we had zero information about them. An 85%-accurate label is far better than no label. And the 15% error rate is concentrated on the hardest features (abstract, polysemantic) that are difficult for humans too.
Feature Labeling Pipeline

Click "Label Next" to watch the agent label SAE features one by one. Each feature gets a natural language description and a confidence score. Watch how different feature types get different quality labels.

Why are LLM-generated feature labels valuable even at ~85% accuracy?

Chapter 5: Circuit Discovery

Beyond labeling individual features, agentic interpretability aims to discover circuits — the pathways through which features interact to implement computations. This is the frontier of agentic interpretability and currently the hardest part to automate.

What is a circuit?

A circuit is a subgraph of the model's computation graph that implements a specific function. For example, the induction circuit in GPT-2 consists of:

Previous Token Head
Attention head in early layer. Moves information from position i to position i+1.
↓ via residual stream
Induction Head
Attention head in later layer. Looks for the pattern "A B ... A" and predicts "B" will come next.
Output
Model copies the token that followed a previous occurrence of the current token. "The cat sat. The cat" → predicts "sat".

Discovering this circuit by hand took months of careful analysis by Olsson et al. (2022). The question is: can an LLM agent discover circuits automatically?

Automated circuit discovery

Kim et al. propose that agents can discover circuits by:

python
# Automated circuit discovery (conceptual)
class CircuitAgent:
    def discover_circuit(self, behavior_description):
        # "Find the circuit that makes the model predict
        #  repeated tokens (induction)"

        # Step 1: Identify relevant features
        candidates = self.find_features_relevant_to(behavior_description)
        # Agent queries feature labels: which features activate
        # on induction-like inputs?

        # Step 2: Trace connections
        graph = self.trace_information_flow(candidates)
        # Which features feed into which? What path does
        # information take through the layers?

        # Step 3: Ablation testing
        # Knock out each candidate feature. Does the behavior
        # disappear? If yes, it's part of the circuit.
        circuit = self.prune_by_ablation(graph)

        # Step 4: Verify
        # Run the minimal circuit alone. Does it reproduce
        # the behavior? If yes, we found it.
        return circuit

Ablation-based circuit validation

Once a candidate circuit is identified, the agent validates it through ablation — systematically removing components and measuring whether the target behavior disappears:

python
# Ablation testing for circuit validation
def validate_circuit(model, circuit_features, behavior_test):
    """Does removing the circuit disable the behavior?"""

    # 1. Measure baseline behavior
    baseline = behavior_test(model)  # e.g., induction accuracy

    # 2. Ablate each feature in the circuit
    for feat in circuit_features:
        model_ablated = ablate_feature(model, feat)
        result = behavior_test(model_ablated)
        print(f"Remove {feat}: behavior drops from {baseline:.2f} to {result:.2f}")

    # 3. Ablate the ENTIRE circuit
    model_no_circuit = ablate_features(model, circuit_features)
    no_circuit_result = behavior_test(model_no_circuit)
    print(f"Full circuit ablation: {baseline:.2f} → {no_circuit_result:.2f}")

    # 4. Test sufficiency: does the circuit ALONE reproduce the behavior?
    model_circuit_only = keep_only_features(model, circuit_features)
    sufficiency = behavior_test(model_circuit_only)
    print(f"Circuit alone: {sufficiency:.2f} (of {baseline:.2f} baseline)")

    # A valid circuit should be both necessary (ablation kills behavior)
    # and sufficient (circuit alone reproduces behavior)
Current limitations. Fully automated circuit discovery is still research-stage. The challenge is combinatorial: a model with 10M features has an astronomically large space of possible circuits. Current agents can narrow the search (by using feature labels to filter candidates) but still require human guidance for the final verification step. Kim et al. view this as a "human-in-the-loop" regime that will become fully automated as agents improve.
Circuit Discovery Visualizer

Watch an agent trace a circuit through a model. Click "Trace" to advance through feature identification, connection tracing, and ablation testing. The agent discovers which features form the induction circuit.

Why is automated circuit discovery harder than automated feature labeling?

Chapter 6: Agent Showcase

Let's bring it all together. This showcase simulates a full agentic interpretability session — an LLM agent examining a model's features, forming hypotheses, running experiments, discovering connections, and producing a comprehensive interpretability report.

Full Interpretability Agent

This simulation runs a complete interpretability session. Select a model layer and click "Run Agent" to watch the agent label features, discover circuits, and generate a report. Drag the "speed" slider to control animation speed.

Layer Layer 6
What the agent produces. In a real deployment, the agent outputs: (1) A dictionary of feature labels with confidence scores, (2) A set of discovered circuits with ablation evidence, (3) A list of potentially dangerous features (bias, deception, harmful content), and (4) Recommendations for further investigation. This report, which would take a team months to compile, is generated in hours.
python
# Example agent output (simplified)
report = agent.full_analysis(model="gpt2-small", layer=6)

print(report.summary)
# Layer 6 Analysis Report
# =======================
# Features labeled: 768/768 (100%)
# High-confidence labels: 652 (85%)
# Circuits discovered: 3
#   - Induction circuit (heads 6.1, 6.9)
#   - Sentiment aggregation (features 142, 507, 623)
#   - Name mover (head 6.3 → head 9.6)
# Safety flags: 2
#   - Feature 441: gender stereotype associations
#   - Feature 687: deceptive phrasing patterns
What does a full agentic interpretability session produce?

Chapter 7: Connections

Agentic interpretability sits at the intersection of two rapidly advancing fields: LLM agents and mechanistic interpretability. Kim et al.'s paper is a manifesto for combining them.

Related WorkYearRelationship
Bills et al. "Neurons in LLMs"2023Precursor: GPT-4 labels GPT-2 neurons (passive, no loop)
Anthropic SAE features2024Target: SAE features are what agents label at scale
Olsson et al. "Induction Heads"2022Gold standard: hand-discovered circuit agents aim to automate
Schut et al. AlphaZero Concepts2024Parallel: extracting human-understandable concepts from AI systems
Kim et al. AI Vocabulary2025Sister paper: what if AI concepts can't be named in human language?
Kim et al. Neologisms2025Solution: teach models to create new words for their own concepts

Open questions

Verification. How do we verify that an agent's interpretability findings are correct? We can't hand-check everything — that defeats the purpose of automation. The simulation score helps but doesn't catch all errors.

Circularity risks. If the agent LLM shares architecture with the target LLM, systematic blind spots in the agent could create systematic blind spots in the interpretability research. Diverse agent architectures may be needed.

Safety applications. The most urgent application is safety auditing — automatically discovering deceptive, biased, or harmful features before deployment. But can we trust an AI agent to honestly report findings about AI safety?

The big picture. Agentic interpretability represents a fundamental shift: from interpretability as a craft practiced by a few dozen researchers to interpretability as an industrial process that scales with model size. If models keep growing, we need interpretability methods that grow with them. LLM agents are the only tool that can match the pace. The question is whether we can make them reliable enough to trust with something this important.

"The tool and the subject are the same. This is either the most powerful or the most dangerous idea in AI safety — probably both."

Interpretability Methods Timeline

Drag to explore the evolution from hand-analysis to fully agentic interpretability.

Era 2025 — Agentic Interp
What is the biggest open challenge for agentic interpretability?