Because we have LLMs, we Can and Should Pursue Agentic Interpretability — use LLM agents to automate interpretability research itself: hypothesis generation, feature analysis, circuit discovery at scale.
You're a neuroscientist studying the brain. Your subject has 86 billion neurons. You can record from one neuron at a time. At one neuron per minute, it would take you 163,000 years to look at each one. You'd never finish. You'd never even begin to understand the circuits.
This is exactly the situation in mechanistic interpretability. GPT-4 has hundreds of billions of parameters organized into millions of features. Researchers have been studying individual neurons, attention heads, and circuits by hand — painstakingly activating neurons, examining what inputs light them up, and forming hypotheses about their function. The results are beautiful: we've found "induction heads" that copy patterns, features that detect sentiment, circuits that perform arithmetic.
But hand-analysis doesn't scale. At the current rate of human interpretability research, we will never understand modern models before even larger ones are deployed.
| Model | Parameters | Estimated Features | Time to hand-analyze |
|---|---|---|---|
| GPT-2 Small | 117M | ~50K | ~50 researcher-years |
| GPT-2 XL | 1.5B | ~500K | ~500 researcher-years |
| LLaMA-2 70B | 70B | ~10M | ~10,000 researcher-years |
| GPT-4 | ~1.7T (est.) | ~100M+ | ~100,000+ researcher-years |
Think of it like using microscopes to study microscopes. The tool becomes both the subject and the instrument of investigation. Kim et al. argue this isn't circular — it's practical. We don't need to fully understand LLMs to use them productively for interpretability research, just as we don't need to fully understand the brain to use it for neuroscience.
Drag the slider to see how model size has outpaced human interpretability capacity. The gap between model complexity and our ability to understand it grows exponentially.
Before we can automate interpretability research, we need to be precise about what it means. Kim et al. define interpretability along three dimensions, each of which can be partially automated.
Feature-level: What does a single neuron, attention head, or SAE (sparse autoencoder) feature respond to? Example: "This feature activates for mentions of the Golden Gate Bridge." This is the most basic level — labeling what individual components do.
Circuit-level: How do multiple features connect to implement a computation? Example: "These three attention heads form an induction circuit that copies repeated patterns." This is mechanistic — it explains how the model computes, not just what individual parts respond to.
Behavioral-level: How does the model behave on specific inputs, and why? Example: "The model is biased toward associating 'doctor' with 'male' because feature X in layer 12 encodes gender stereotypes." This connects internal mechanisms to observable behavior.
Good interpretability research follows the scientific method:
Each of these steps involves pattern recognition, language understanding, and reasoning — exactly the capabilities LLMs excel at. The insight is that an LLM can perform this entire loop automatically, faster than a human researcher.
python # The interpretability research loop — done by hand vs by agent # HUMAN RESEARCHER (days to weeks per feature): # 1. Record activations on 1000 inputs # 2. Manually inspect top-activating inputs # 3. Think: "these all seem to be about cooking..." # 4. Generate test cases: "bake cake" vs "drive car" # 5. Verify: feature activates for cooking, not driving # 6. Publish: "Feature 3427 = cooking concept" # LLM AGENT (minutes per feature): # 1. Same activation recording (automated) # 2. LLM reads top-activating texts, produces hypothesis # 3. LLM generates diverse test cases # 4. Tests are run automatically # 5. LLM evaluates results, refines hypothesis # 6. Outputs: "Feature 3427 = cooking-related verbs"
Click each level to see what questions it answers and how LLM agents can automate it. Higher levels build on lower ones.
What makes LLMs suitable for interpretability research? Kim et al. identify four capabilities that align with the demands of interpretability work.
Given the top-20 texts that maximally activate a neuron, a human researcher reads them and identifies commonalities. An LLM can do the same — and it's often better at spotting subtle patterns across many examples.
python # Example: auto-labeling with an LLM top_activating_texts = [ "She baked a chocolate cake for her birthday", "The chef prepared a soufflé in the kitchen", "He grilled the salmon to perfection", "Mix the flour and eggs until smooth", "The restaurant served excellent pasta", ] prompt = f"""You are an interpretability researcher. Below are the top-5 texts that maximally activate a specific neuron in a language model. {chr(10).join(top_activating_texts)} What concept does this neuron likely encode? Be specific and testable.""" label = llm.generate(prompt) # "This neuron encodes cooking/food-preparation verbs and # food-related nouns, particularly in the context of # preparing or serving meals."
After forming an initial label, the LLM can generate testable hypotheses: "If this neuron encodes cooking verbs, it should NOT activate for 'The chef drove to the store' (no cooking verb) and SHOULD activate for 'I sautéed the onions' (cooking verb in novel context)."
The LLM can design targeted experiments — minimal pairs, ablation studies, counterfactual inputs — that would take a human researcher hours to construct.
After running experiments, the LLM can interpret the results, update its hypothesis, and iterate. This closes the loop — making the research process fully automated.
Let's trace a complete agent session on a real feature. The agent is analyzing Feature 4519 from a GPT-2 sparse autoencoder:
agent session trace # === Agent Session: Feature 4519 === # Top-5 activating texts: # 1. "The derivative of x² is 2x" (activation: 0.94) # 2. "Solve for y in the equation 3y + 7 = 22" (0.91) # 3. "The integral converges for |x| < 1" (0.89) # 4. "Let f(x) = sin(x) + cos(x)" (0.87) # 5. "Calculate the eigenvalues of matrix A" (0.85) # # Agent Hypothesis (round 1): # "Mathematical expressions with variables" # Simulation score: 0.72 # # Agent tests: "The price was $42" → activates? YES (unexpected!) # Agent tests: "Two plus two equals four" → activates? NO (expected) # Agent tests: "The velocity v = d/t" → activates? YES (expected) # # Agent Hypothesis (round 2, refined): # "Mathematical notation with symbolic variables (not spelled out numbers)" # Simulation score: 0.86 # # Agent tests: "f(x) = x" → YES ✓ | "the function of x" → NO ✓ # Agent tests: "$100" → YES (symbols count!) | "one hundred" → NO ✓ # # Agent Hypothesis (round 3, final): # "Symbolic mathematical notation: equations, formulas, and expressions # using mathematical symbols (=, +, ², ∫) and single-letter variables # (x, y, f, n), including currency symbols ($, €)" # Simulation score: 0.93 → CONVERGED
OpenAI's "Language models can explain neurons in language models" (Bills et al., 2023) was a precursor. They used GPT-4 to label every neuron in GPT-2 — producing natural language descriptions of what each neuron responds to. Kim et al. extend this from passive labeling to active research: not just describing features, but discovering circuits, testing hypotheses, and proposing new interpretability methods.
Click "Analyze Feature" to watch an LLM agent examine neuron activations, form a hypothesis, generate test cases, and refine its label. This is the core agentic loop.
Kim et al. propose a specific architecture for interpretability agents. It's a loop — the agent iterates between observation, hypothesis, experiment, and refinement until it converges on a precise explanation.
The loop terminates when the agent's hypothesis accurately predicts activations on a held-out test set. Specifically, the agent's description should let you predict whether the feature will activate on a new input with >90% accuracy. This is the simulation score — can the label "simulate" the feature's behavior?
python # The agentic interpretability loop class InterpAgent: def __init__(self, target_model, agent_llm): self.target = target_model self.agent = agent_llm def analyze_feature(self, feature_idx, dataset, max_iters=5): # Phase 1: Observe activations = self.target.get_activations(feature_idx, dataset) top_examples = get_top_k(activations, k=20) bottom_examples = get_bottom_k(activations, k=20) hypothesis = None for i in range(max_iters): # Phase 2: Hypothesize hypothesis = self.agent.generate_hypothesis( top_examples, bottom_examples, prev_hypothesis=hypothesis ) # Phase 3: Experiment test_cases = self.agent.design_experiments(hypothesis) predictions = self.agent.predict_activations(hypothesis, test_cases) actuals = self.target.get_activations(feature_idx, test_cases) # Phase 4: Refine score = compute_simulation_score(predictions, actuals) if score > 0.9: return hypothesis, score # Loop back with updated examples top_examples = update_with_failures(predictions, actuals) return hypothesis, score
The agentic loop can fail in several ways. Understanding these failure modes is essential for designing reliable agents:
| Failure Mode | What Happens | Guardrail |
|---|---|---|
| Confirmation bias | Agent generates tests that confirm its hypothesis rather than challenge it | Require adversarial test cases — inputs designed to DISPROVE the hypothesis |
| Overfit to examples | Agent describes the top-5 examples literally rather than abstracting the pattern | Require hypothesis to predict activations on unseen inputs (out-of-distribution test) |
| Premature convergence | Agent declares success at 85% when more iterations would reach 95% | Set minimum iteration count (at least 3 rounds) before allowing convergence |
| Hallucinated patterns | Agent "finds" a pattern that doesn't exist (Type I error) | Require statistical significance: simulation score must exceed random baseline by 3σ |
python # Guardrail: adversarial test case generation def generate_adversarial_tests(hypothesis, agent): """Generate inputs designed to DISPROVE the hypothesis.""" prompt = f"""Your hypothesis is: "{hypothesis}" Generate 10 test inputs that would DISPROVE this hypothesis if it's wrong. Focus on edge cases and near-misses. Example: If hypothesis is "cooking verbs", generate: - "The recipe book was on the shelf" (cooking context, no cooking verb) - "She drove to the restaurant" (food context, no cooking) - "He prepared the documents" (non-cooking use of "prepared")""" return agent.generate(prompt)
Click "Iterate" to run one cycle of the agentic loop. Watch how the hypothesis gets progressively refined with each round of testing. The simulation score improves as the agent learns edge cases.
The most mature application of agentic interpretability is automated feature labeling — having an LLM describe what each feature in a model responds to. This has been validated at scale on sparse autoencoder (SAE) features.
Sparse Autoencoders (SAEs) decompose model activations into interpretable directions. An SAE with 16,384 features might decompose layer 12's activation space into 16,384 monosemantic features — each responding to a single concept. The problem: labeling 16,384 features by hand takes months.
python # Automated feature labeling pipeline from sae_lens import SAE # 1. Load SAE trained on a target model sae = SAE.load("gpt2-small-layer12-16384") # 2. For each of 16,384 features: for feat_idx in range(16384): # Get top-activating examples from a large corpus top_texts = sae.get_max_activating(feat_idx, corpus, k=20) # Ask LLM to label label = agent.label_feature(top_texts) # "Feature 7823: First-person pronouns in informal writing" # "Feature 12041: Geographic locations in Europe" # "Feature 4519: Mathematical expressions with variables" # 3. Result: a complete dictionary of feature meanings # 16,384 labeled features in ~2 hours (vs ~6 months by hand)
The speed advantage of automated labeling is enormous:
| Method | Features/hour | Time for 16K features | Cost (est.) |
|---|---|---|---|
| Expert hand-labeling | 2-5 | 3,200-8,000 hours | $320K-$800K (researcher salary) |
| Crowdsourced labeling | 20-50 | 320-800 hours | $16K-$40K (Mechanical Turk) |
| Single-pass LLM | 500-1,000 | 16-32 hours | $50-$100 (API costs) |
| Agentic loop (3 iterations) | 100-200 | 80-160 hours | $500-$1,000 (API costs) |
The agentic loop is slower than single-pass labeling because each feature requires multiple rounds of hypothesis testing. But the quality improvement (85% vs 75% accuracy) is worth the 3-5x slowdown. And it's still 40-100x faster than expert hand-labeling.
python # Cost breakdown for labeling 16,384 SAE features # Using GPT-4 as the agent LLM: # # Per feature (agentic loop, 3 iterations): # - Observation prompt: ~500 tokens input, ~200 output # - Hypothesis prompt: ~800 tokens input, ~100 output # - Test design: ~400 tokens input, ~300 output # - Refinement: ~600 tokens input, ~200 output # - Total per iteration: ~2300 input + ~800 output tokens # - Total per feature (3 iters): ~6900 input + ~2400 output # # 16,384 features × (6900 × $0.03/1K + 2400 × $0.06/1K) # = 16,384 × ($0.207 + $0.144) # = 16,384 × $0.351 # = $5,751 total # # Compare: expert labeling at $100/hour × 4000 hours = $400,000 # Automated labeling is 70x cheaper
How good are LLM-generated labels? Bills et al. (2023) found that GPT-4 labels match human labels ~75% of the time and outperform simpler baselines. The agentic loop (with hypothesis testing) pushes this to ~85%. Some features are harder to label than others:
| Feature Type | Auto-label Quality | Why |
|---|---|---|
| Concrete concepts | ~95% | "Sports", "colors", "animals" — obvious from examples |
| Syntactic patterns | ~85% | "Subordinate clauses", "list items" — requires linguistic knowledge |
| Abstract/compositional | ~65% | "Irony", "contrast between expectations" — subtle, context-dependent |
| Polysemantic | ~40% | Feature responds to multiple unrelated concepts — genuinely hard |
Click "Label Next" to watch the agent label SAE features one by one. Each feature gets a natural language description and a confidence score. Watch how different feature types get different quality labels.
Beyond labeling individual features, agentic interpretability aims to discover circuits — the pathways through which features interact to implement computations. This is the frontier of agentic interpretability and currently the hardest part to automate.
A circuit is a subgraph of the model's computation graph that implements a specific function. For example, the induction circuit in GPT-2 consists of:
Discovering this circuit by hand took months of careful analysis by Olsson et al. (2022). The question is: can an LLM agent discover circuits automatically?
Kim et al. propose that agents can discover circuits by:
python # Automated circuit discovery (conceptual) class CircuitAgent: def discover_circuit(self, behavior_description): # "Find the circuit that makes the model predict # repeated tokens (induction)" # Step 1: Identify relevant features candidates = self.find_features_relevant_to(behavior_description) # Agent queries feature labels: which features activate # on induction-like inputs? # Step 2: Trace connections graph = self.trace_information_flow(candidates) # Which features feed into which? What path does # information take through the layers? # Step 3: Ablation testing # Knock out each candidate feature. Does the behavior # disappear? If yes, it's part of the circuit. circuit = self.prune_by_ablation(graph) # Step 4: Verify # Run the minimal circuit alone. Does it reproduce # the behavior? If yes, we found it. return circuit
Once a candidate circuit is identified, the agent validates it through ablation — systematically removing components and measuring whether the target behavior disappears:
python # Ablation testing for circuit validation def validate_circuit(model, circuit_features, behavior_test): """Does removing the circuit disable the behavior?""" # 1. Measure baseline behavior baseline = behavior_test(model) # e.g., induction accuracy # 2. Ablate each feature in the circuit for feat in circuit_features: model_ablated = ablate_feature(model, feat) result = behavior_test(model_ablated) print(f"Remove {feat}: behavior drops from {baseline:.2f} to {result:.2f}") # 3. Ablate the ENTIRE circuit model_no_circuit = ablate_features(model, circuit_features) no_circuit_result = behavior_test(model_no_circuit) print(f"Full circuit ablation: {baseline:.2f} → {no_circuit_result:.2f}") # 4. Test sufficiency: does the circuit ALONE reproduce the behavior? model_circuit_only = keep_only_features(model, circuit_features) sufficiency = behavior_test(model_circuit_only) print(f"Circuit alone: {sufficiency:.2f} (of {baseline:.2f} baseline)") # A valid circuit should be both necessary (ablation kills behavior) # and sufficient (circuit alone reproduces behavior)
Watch an agent trace a circuit through a model. Click "Trace" to advance through feature identification, connection tracing, and ablation testing. The agent discovers which features form the induction circuit.
Let's bring it all together. This showcase simulates a full agentic interpretability session — an LLM agent examining a model's features, forming hypotheses, running experiments, discovering connections, and producing a comprehensive interpretability report.
This simulation runs a complete interpretability session. Select a model layer and click "Run Agent" to watch the agent label features, discover circuits, and generate a report. Drag the "speed" slider to control animation speed.
python # Example agent output (simplified) report = agent.full_analysis(model="gpt2-small", layer=6) print(report.summary) # Layer 6 Analysis Report # ======================= # Features labeled: 768/768 (100%) # High-confidence labels: 652 (85%) # Circuits discovered: 3 # - Induction circuit (heads 6.1, 6.9) # - Sentiment aggregation (features 142, 507, 623) # - Name mover (head 6.3 → head 9.6) # Safety flags: 2 # - Feature 441: gender stereotype associations # - Feature 687: deceptive phrasing patterns
Agentic interpretability sits at the intersection of two rapidly advancing fields: LLM agents and mechanistic interpretability. Kim et al.'s paper is a manifesto for combining them.
| Related Work | Year | Relationship |
|---|---|---|
| Bills et al. "Neurons in LLMs" | 2023 | Precursor: GPT-4 labels GPT-2 neurons (passive, no loop) |
| Anthropic SAE features | 2024 | Target: SAE features are what agents label at scale |
| Olsson et al. "Induction Heads" | 2022 | Gold standard: hand-discovered circuit agents aim to automate |
| Schut et al. AlphaZero Concepts | 2024 | Parallel: extracting human-understandable concepts from AI systems |
| Kim et al. AI Vocabulary | 2025 | Sister paper: what if AI concepts can't be named in human language? |
| Kim et al. Neologisms | 2025 | Solution: teach models to create new words for their own concepts |
Verification. How do we verify that an agent's interpretability findings are correct? We can't hand-check everything — that defeats the purpose of automation. The simulation score helps but doesn't catch all errors.
Circularity risks. If the agent LLM shares architecture with the target LLM, systematic blind spots in the agent could create systematic blind spots in the interpretability research. Diverse agent architectures may be needed.
Safety applications. The most urgent application is safety auditing — automatically discovering deceptive, biased, or harmful features before deployment. But can we trust an AI agent to honestly report findings about AI safety?
"The tool and the subject are the same. This is either the most powerful or the most dangerous idea in AI safety — probably both."
Drag to explore the evolution from hand-analysis to fully agentic interpretability.