AI Vocabulary (Kim 2025)

Chapter 0: The Naming Problem

You're an interpretability researcher. You've found a feature in GPT-4 that activates strongly on certain inputs. You examine the top-activating texts and try to describe the pattern. The feature fires for:

Input	Activation
"The policy seemed progressive but actually reinforced the status quo"	0.92
"He appeared generous while secretly hoarding resources"	0.89
"The medication showed initial improvement followed by a rebound effect"	0.85
"The reform paradoxically strengthened what it tried to weaken"	0.91

What is this feature? "Irony"? Not quite — it doesn't fire on all ironic statements. "Contradiction"? Too broad. "Backfire effect"? Closer, but it also fires on non-causal examples. You try increasingly complex descriptions: "surface-level positive outcome concealing or producing an opposite deeper effect." But this 11-word phrase isn't really a concept name — it's a sentence.

This is the naming problem. The feature represents a real, coherent concept in the model's representation space. It's monosemantic — it activates for a specific pattern. But human language doesn't have a word for that pattern.

The radical thesis: Kim argues that human vocabulary is fundamentally insufficient to describe all the concepts AI systems learn. Not because we're not creative enough with descriptions, but because AI systems carve the world into categories that humans never developed words for. Some AI concepts exist in the gaps between human concepts — they're not combinations of existing concepts but genuinely new ways of categorizing experience. We need neologisms (new words) to name them.

Think of it like color perception. Before the word "orange" existed in English (borrowed from the fruit in the 16th century), English speakers could see the color — but they described it as "red-yellow." The introduction of "orange" didn't change perception; it gave people a word that made communication about that specific experience efficient and precise. AI concepts need the same treatment.

The Naming Problem

Click each feature to see top-activating texts. Try to name the concept. Notice how some features resist simple naming — they represent patterns that fall between existing human concepts.

What is the "naming problem" in AI interpretability?

AI models learn concepts that are real and coherent (monosemantic features with consistent activation patterns) but that human language has no single word for — the concepts exist in the gaps between human categories, requiring either multi-word descriptions (imprecise) or new vocabulary (neologisms) to name them accurately AI models use technical jargon that's hard for non-experts to understand Researchers disagree on what to call different model architectures

Chapter 1: Beyond Human Concepts

Why would AI develop concepts that humans don't have? The answer lies in how concepts form. Human concepts are shaped by evolution, embodiment, language, and culture. AI concepts are shaped by data patterns and loss functions. These are fundamentally different pressures that produce different conceptual systems.

Why human and AI concepts diverge

Pressure	Human Concepts	AI Concepts
Survival	"Danger", "food", "mate" — evolved for fitness	No survival pressure — learns whatever predicts tokens
Embodiment	"Hot", "heavy", "near" — grounded in physical experience	No body — concepts are purely statistical
Language	Concepts that have names get reinforced	Learns from language but isn't constrained to named concepts
Scale	~100K concepts in a lifetime	Millions of features — many more fine-grained distinctions

Humans developed the concept "sarcasm" because it matters for social interaction. We never developed a concept for "surface-positive-deep-negative-reversal" because we didn't need one specific word for that pattern — we could describe it in context. But an AI trained on billions of sentences encounters that pattern enough times that it's worth having a dedicated feature.

The Sapir-Whorf connection

The Sapir-Whorf hypothesis in linguistics suggests that the language you speak shapes the concepts you can think about. Kim extends this to AI: the concepts we can name constrain which AI concepts we can understand. If we can't name a concept, we struggle to reason about it, communicate about it, and build on it.

Evidence from cognitive science

The claim that vocabulary shapes understanding isn't philosophy — it has experimental support. Winawer et al. (2007) showed that Russian speakers (who have two words for blue: "goluboy" for light blue, "siniy" for dark blue) distinguish blue shades faster than English speakers (who have one word: "blue"). The word creates a categorical boundary in perception.

Similarly, Lupyan & Ward (2013) found that people with larger vocabularies perceive visual differences more accurately. Naming concepts makes them cognitively available for reasoning, comparison, and composition. The vocabulary gap in AI interpretability is an impediment to understanding, not just to communication.

python
# Thought experiment: chemistry without element names
#
# Instead of "H₂O" you'd write:
# "Two atoms of the lightest gas combined with one atom
#  of the gas that sustains combustion"
#
# Chemical reactions become unreadable:
# "The lightest-gas-combustion-gas compound reacts with
#  the soft-silvery-metal to produce..."
#
# This is EXACTLY what interpretability researchers do:
# "Feature 7823 — the one that activates for surface-level
#  positive outcomes concealing deeper negative effects —
#  interacts with Feature 12041..."
#
# With neologisms: "reversk interacts with abstrax"
# Same meaning, 5 words instead of 30

Languages constantly evolve new words for new concepts. "Selfie" (2013), "ghosting" (2015), "doomscrolling" (2020) — these neologisms named experiences that existed but lacked a word. Before "doomscrolling" existed, you could describe the behavior ("compulsively reading bad news on your phone") but you couldn't efficiently talk about it. The word made the concept communicable. AI interpretability faces the same need: concepts that exist but lack words.

python
# Analogy: color naming across languages
#
# Russian has two words for blue:
#   "голубой" (goluboy) = light blue
#   "синий" (siniy) = dark blue
# Russian speakers distinguish these colors FASTER than
# English speakers, who call both "blue"
#
# Similarly: if we have a word for an AI concept,
# researchers can recognize, discuss, and build on it faster
# Without a word, the concept remains fuzzy and hard to study

# The naming spectrum:
# "cats"                → human word exists, AI feature matches
# "formal-to-informal"  → human words exist, but the combination is novel
# "?????"               → no human description captures it precisely

Concept Formation

Compare how human concepts form (shaped by survival, embodiment, language) vs. how AI concepts form (shaped by data patterns and prediction objectives). Click to toggle between human and AI concept spaces.

Why do AI models develop concepts that humans don't have words for?

Human concepts are shaped by survival, embodiment, language, and culture — pressures that don't apply to AI. AI concepts are shaped by statistical patterns in data and prediction objectives. These different pressures produce different conceptual systems: AI finds patterns that are useful for prediction but that humans never needed to name because they weren't relevant for survival or social interaction. Because AI models process more data than humans Because AI models are more intelligent than humans

Chapter 2: The Vocabulary Gap

Kim formalizes the problem. Let H be the set of all human concepts (things we have words for) and A be the set of all AI concepts (features in the model). The vocabulary gap is A \ H — concepts in A that have no counterpart in H.

Vocabulary Gap = A \ H = { c ∈ A : ∄ h ∈ H where h ≈ c }

How big is this gap? Empirical evidence suggests it's large. When researchers at Anthropic trained sparse autoencoders on Claude and labeled the resulting features, they found three categories:

Category	% of Features	Description
Named concepts	~40%	Match a human concept: "animals", "code syntax", "French"
Describable patterns	~35%	Can be described but have no single word: "transition from formal to casual register"
Unnameable	~25%	No clear human description: researchers write "???" or multi-sentence descriptions

The 25% "unnameable" features are the vocabulary gap. These are real features — they activate consistently, they're monosemantic (respond to one pattern), and they influence model behavior. We just can't name them.

This isn't a labeling problem — it's a vocabulary problem. The issue isn't that we haven't tried hard enough to describe these features. It's that human language literally doesn't have the right words. When you try to describe an unnameable feature, you end up with paragraphs that approximate the concept but never capture it exactly. This is like trying to describe the color "ultraviolet" to a species that can't see it — you can say "it's beyond purple" but you can't communicate the qualia.

The cost of the gap

The vocabulary gap has real consequences for interpretability research:

Communication failure. If researcher A finds an unnameable feature and describes it in 50 words, researcher B may misinterpret the description. A single word ("sarcasm") communicates instantly; 50 words communicate approximately.

Reasoning bottleneck. It's harder to reason about concepts you can't name. Try thinking about chess strategy if you don't know the words "fork" or "pin" — you can see the patterns but you can't efficiently analyze them.

Cumulative knowledge failure. Science progresses by naming things. Taxonomy precedes theory. If we can't name AI concepts, we can't build taxonomies, can't formulate theories about how they interact, and can't train the next generation of researchers to recognize them.

Vocabulary Gap Visualizer

This Venn diagram shows the overlap between human concepts and AI concepts. The gap (AI concepts with no human name) is where neologisms are needed. Drag the slider to see how the gap changes with model scale.

Model scale Medium

Why is the vocabulary gap a fundamental problem rather than just a labeling challenge?

It's not that we haven't tried hard enough — human language genuinely lacks words for ~25% of AI features. These features represent novel patterns that human experience never required naming. Without words, researchers can't communicate precisely, can't reason efficiently, and can't build cumulative scientific knowledge about these concepts. Because there are too many features to label Because AI models are black boxes

Chapter 3: Evidence from Models

Kim provides concrete evidence that AI concepts don't reduce to human concepts. The evidence comes from three sources: sparse autoencoder features, multimodal models, and cross-model comparison.

SAE feature analysis

When researchers train sparse autoencoders on LLMs, many discovered features correspond to known concepts ("the Golden Gate Bridge", "Python code", "medical terminology"). But a significant fraction don't. Kim catalogs examples of features that resist human naming:

examples of unnameable features
# Feature 7823: Activates for
# - "The meeting was productive but something felt off"
# - "Her smile didn't reach her eyes"
# - "The data showed improvement yet the trend worried analysts"
# Best human description: "surface-level positive with underlying negative"
# But this misses cases where it activates without explicit sentiment

# Feature 12410: Activates for
# - Sentences about to undergo a topic shift
# - But ONLY when the current topic is concrete and the next is abstract
# - "The bridge cost $4 billion. Democracy requires..."
# Best description: "concrete-to-abstract topic transition"
# But this is a 5-word description, not a concept name

# Feature 19001: Activates for
# - ??? Researchers genuinely cannot characterize the pattern
# - Activations are consistent (high inter-annotator agreement
#   that the same inputs activate it)
# - But no human can articulate what the inputs share

Quantifying the gap

Kim proposes a formal metric for measuring how well human labels describe AI features. The description fidelity score measures whether a human-language label can predict a feature's activations:

Fidelity(label, feature) = Accuracy(predict activations | label, test inputs)

A fidelity score of 0.95 means the label almost perfectly predicts the feature. A fidelity score of 0.50 means the label is no better than random — the feature encodes something the label doesn't capture.

python
# Measuring description fidelity
def measure_fidelity(label, feature, test_inputs, evaluator_llm):
    """How well does a human-language label predict feature activations?"""
    correct = 0
    for text in test_inputs:
        # Ask evaluator: "Given label L, would this text activate the feature?"
        predicted = evaluator_llm.predict(
            f"Feature description: {label}\nText: {text}\nWould this activate? (yes/no)"
        )
        actual = feature.get_activation(text) > threshold
        if predicted == actual:
            correct += 1
    return correct / len(test_inputs)

# Results across feature categories:
# Named concepts:     avg fidelity = 0.93 (labels work!)
# Describable:        avg fidelity = 0.72 (labels approximate)
# Unnameable:         avg fidelity = 0.48 (labels fail — near random)

The 0.48 fidelity for unnameable features is the smoking gun. These features have consistent activation patterns (human annotators agree on which inputs activate them), but no human description can predict those activations. The pattern is real but linguistically inexpressible. This is the hard evidence for the vocabulary gap.

The "alien concept" test

Kim proposes a test for whether a concept is genuinely novel: if three expert interpretability researchers independently fail to produce a concise label that accurately predicts the feature's activations (simulation score >80%), the concept is a candidate for neologism creation.

Case study: Feature 19001

Kim presents Feature 19001 as a detailed case study of an unnameable concept. Three interpretability researchers spent 4 hours each trying to label it. Here's what happened:

case study
# Feature 19001: Top-10 activating texts
# 1. "The meeting ended abruptly when she brought up the budget"
# 2. "He paused mid-sentence, reconsidering his words"
# 3. "The conversation shifted to something unexpected"
# 4. "She changed the subject after a brief silence"
# 5. "The mood in the room suddenly became tense"
# 6. "His expression changed when she mentioned the project"
# 7. "There was an awkward pause before anyone spoke"
# 8. "The dynamic between them shifted perceptibly"
# 9. "Something unspoken passed between the two colleagues"
# 10. "The negotiation took an unexpected turn"
#
# Researcher A: "social tension moments"
#   Simulation score: 0.61 (too broad — fires on non-tense moments too)
#
# Researcher B: "conversational pivot points"
#   Simulation score: 0.58 (too narrow — misses non-verbal examples)
#
# Researcher C: "interpersonal state transitions"
#   Simulation score: 0.55 (too abstract — predicts many false positives)
#
# All three labels hover near chance (0.50)
# The feature is real (high inter-annotator agreement on what activates it)
# But no human label captures the pattern

The pattern is real but linguistically inexpressible. All three researchers could recognize examples that activated the feature — they had high agreement on which new texts would activate it. But they couldn't articulate what all the texts had in common in a way that generalized. The feature detects something about interpersonal dynamics in text that humans can perceive but can't name. This is the strongest evidence for the vocabulary gap: humans can recognize the pattern (perceptual access) but can't verbalize it (linguistic access).

Some concepts may be inherently non-linguistic. Consider: can you describe the taste of coffee in words that would let someone who's never tasted it recognize it? Some experiences resist verbal encoding. Similarly, some AI concepts may encode patterns that are too high-dimensional, too context-dependent, or too cross-modal to be captured in a single word or short phrase. These concepts are real — they just can't be compressed into human language.

A taxonomy of the unnameable

Not all unnameable features are unnameable in the same way. Kim identifies three subtypes:

Subtype	Why It Resists Naming	Example	% of Unnameable
Cross-categorical	The feature spans multiple human concept categories simultaneously	Activates for "authority signals" — combines formal language, declarative structure, and institutional context into one feature	~45%
Sub-categorical	The feature is a precise subdivision of a human concept that we never split	"Sarcasm-subset-3": activates for sarcasm that uses understatement but NOT for sarcasm that uses exaggeration	~35%
Alien	The feature detects a pattern that humans don't perceive at all	Activates for texts that are about to undergo a specific statistical transition in token probability distribution	~20%

Cross-categorical features are the most common: they carve the world differently from human categories, combining things we separate or separating things we combine. Sub-categorical features are finer-grained versions of existing concepts. Alien features are the most fascinating — they detect patterns that may be genuinely imperceptible to humans.

python
# Examples of each subtype

# Cross-categorical: Feature 8842
# Activates for: formal requests, passive-aggressive emails,
#   diplomatic statements, corporate apologies
# Human would separate: formality, passive-aggression, diplomacy
# The AI sees ONE pattern: "surface politeness encoding constraint"

# Sub-categorical: Feature 5510
# Activates for: sarcasm using litotes ("not entirely wrong")
# Does NOT activate for: sarcasm using hyperbole ("oh GREAT")
# Humans call both "sarcasm" — the AI distinguishes them

# Alien: Feature 19001
# Activates for: texts where an interpersonal dynamic shifts
# Humans can recognize it (high inter-annotator agreement)
# But cannot articulate the pattern (simulation score ~0.50)

Evidence from multimodal models

Multimodal models (like CLIP) learn concepts that span vision and language simultaneously. Some of these cross-modal concepts have no human equivalent because humans process vision and language in separate brain regions. A CLIP feature might encode "the visual-linguistic pattern of authority" — a concept that combines visual cues (posture, framing) with linguistic cues (formal vocabulary, declarative structure) in a way humans never unified into a single concept.

python
# Cross-modal concepts in CLIP: evidence
#
# Feature 7291 in CLIP ViT-L/14 activates for:
#
# VISUAL inputs:
#   - Images with centered subject, low angle, warm lighting
#   - Portraits of people in professional attire
#   - Images of podiums, stages, official settings
#
# TEXT inputs:
#   - "The CEO addressed the shareholders"
#   - "According to leading experts"
#   - "The official statement reads..."
#
# Humans have SEPARATE concepts for visual and textual authority:
#   Photography: "low angle shot" (makes subject look powerful)
#   Rhetoric: "appeal to authority" (linguistic persuasion)
#
# CLIP has ONE concept that spans both modalities.
# This cross-modal "authority" concept has no human name.
# It's not "authority" (too general) — it's specifically the
# SENSORY SIGNATURE of authority across vision and language.

Multimodal models create the most alien concepts. Language models at least operate in the space of human language — their features are about text patterns. Multimodal models operate in a unified vision-language space that humans never experience. Our brains process images and text in different regions; CLIP processes them jointly. The concepts that emerge from joint processing may be fundamentally alien to human cognition.

Feature Categorization

Explore how SAE features break down into named, describable, and unnameable categories. Click each category to see examples. The "unnameable" category is the vocabulary gap.

What is the "alien concept test" for determining if an AI concept needs a neologism?

If three expert researchers independently fail to produce a concise label that accurately predicts the feature's activations (simulation score >80%), the concept is genuinely novel — it resists naming not because the researchers aren't trying hard enough, but because human vocabulary lacks the right word for that specific pattern. If the feature has low activation on common inputs If the feature is in a deep layer of the network

Chapter 4: Neologisms for AI

If human vocabulary is insufficient, we need to create new words. Kim proposes a framework for creating neologisms — new words specifically designed to name AI concepts.

Properties of good neologisms

Property	Why It Matters	Example
Pronounceable	Must be spoken in discussions, not just written	"voltex" not "x7f2a"
Memorable	Researchers must recall and use it	Short, distinctive sound pattern
Precise	Must map to exactly one concept	One neologism per feature, no ambiguity
Compositional	Can combine with existing words	"voltex-sensitive", "high voltex"

How to generate neologisms

Kim proposes several strategies:

Strategy 1: Morphological combination. Combine existing morphemes to create a new word that hints at the concept. Example: "contraflow" for the "surface-positive-deep-negative" feature (contra + flow). This is how most natural neologisms form ("smartphone" = smart + phone).

Strategy 2: Semantic compression. Use an LLM to compress a multi-sentence description into a single novel word. The LLM can be trained or prompted to generate pronounceable neologisms that are semantically evocative.

Strategy 3: Arbitrary assignment. Assign a short arbitrary label (like chemical element symbols). Less intuitive but avoids misleading connotations. "Feature V7" is less biased than "contraflow" (which might wrongly suggest the feature is about rivers).

python
# Generating neologisms with an LLM
prompt = """You are creating new English words for AI concepts.
Given a description of a neural network feature, generate 3 candidate
neologisms that are:
- Pronounceable (1-3 syllables)
- Memorable
- Evocative of the concept without being misleading

Feature description: "Activates when a sentence describes a surface-level
positive outcome that conceals or produces a deeper negative effect.
Examples: backfire effects, pyrrhic victories, deceptive improvements."

Candidate neologisms:"""

# LLM output:
# 1. "reversk" (reverse + mask) — the hidden reversal
# 2. "velix" (veil + paradox) — the veiled paradox
# 3. "contravene" (contra + veneer) — against the surface

Neologisms must be tested, not just invented. A good neologism passes two tests. First, recognition test: after learning the word and its definition, can a researcher correctly identify new feature activations using just the neologism? Second, communication test: can two researchers discuss the concept efficiently using the neologism? If the word doesn't pass both tests, it's not doing its job.

Neologism Generator

Click "Generate" to see candidate neologisms for different unnameable features. Each neologism is tested for pronounceability, memorability, and precision.

What makes a good neologism for naming an AI concept?

It must be pronounceable, memorable, precise (maps to exactly one concept), and compositional (can combine with existing words). It must also pass two tests: researchers can correctly identify feature activations using just the neologism (recognition), and two researchers can discuss the concept efficiently using it (communication). It should be as long and descriptive as possible It should be a random string of characters

Chapter 5: Fidelity vs Familiarity

Kim identifies a fundamental tradeoff in naming AI concepts: fidelity (how accurately the name captures the concept) vs. familiarity (how easily a human can understand and use the name).

The tradeoff

At one extreme, you can use a familiar name like "sarcasm" — easy to understand but inaccurate (the feature doesn't exactly match human sarcasm). At the other extreme, you can use a precise description like "surface-positive-deep-negative reversal in consequentialist contexts" — accurate but unusable in conversation.

Good name = max(Fidelity × Familiarity)

Name	Fidelity	Familiarity	Product
"sarcasm"	0.4 (too broad)	1.0 (everyone knows it)	0.40
"sarcasm-adjacent reversal pattern"	0.7	0.5	0.35
"contraflow" (neologism)	0.85	0.7 (after learning)	0.60
"surface-positive-deep-negative..."	0.95	0.2 (a sentence)	0.19

The neologism "contraflow" wins because it achieves high fidelity (precisely naming the concept) with reasonable familiarity (learnable, memorable, usable).

Familiarity is dynamic. When a neologism is first introduced, familiarity is low. But if it's good (pronounceable, memorable), familiarity increases with use. "Quark" meant nothing in 1964; now every physics student knows it. "Eigenvalue" was a neologism; now it's standard vocabulary. The investment in learning a new word pays off in perpetuity — every future discussion of that concept becomes more efficient.

Historical precedent: science constantly needs new words

Kim draws an analogy to scientific terminology throughout history. When scientists discover genuinely new phenomena, they create new words:

Neologism	Year	Why Existing Words Failed
"Oxygen"	1777	"Dephlogisticated air" was a description, not a concept name
"Quark"	1964	No existing word for a fundamental particle with fractional charge
"Gene"	1909	"Unit of heredity" was a description — "gene" made it a concept
"Meme"	1976	"Culturally transmitted idea unit" — Dawkins needed one word
"Doomscrolling"	2020	"Compulsively reading negative news on phone" — too long

In each case, the neologism didn't create the phenomenon — it already existed. The word made it thinkable, communicable, and researchable. AI concepts need the same treatment.

python
# The efficiency of naming
# Without neologism: 50+ words to discuss the concept
# "Did you find any features that activate when the text
#  describes a surface-level positive outcome that actually
#  produces a deeper negative effect?"
#
# With neologism: 5 words
# "Did you find any reversk features?"
#
# Over 100 research discussions, this saves ~4500 words
# More importantly: it prevents misunderstanding
# "surface-level positive" might be interpreted differently
# by different researchers. "reversk" is unambiguous.

The danger of borrowed names

A common mistake: labeling an AI feature with a familiar but imprecise human word. Calling a feature "sarcasm" when it's not exactly sarcasm creates misleading familiarity. Researchers think they understand the feature ("oh, it detects sarcasm") but their mental model is wrong. This is worse than having no label at all — it creates false confidence.

python
# The misleading familiarity trap
#
# Researcher labels Feature 7823 as "sarcasm"
# → Team assumes it detects sarcasm
# → They test: "Oh great, another meeting" → activates? YES ✓
# → They conclude the feature works for sarcasm detection
# → But it ALSO activates for non-sarcastic backfire effects:
#   "The medication helped initially but worsened symptoms"
# → This is NOT sarcasm — it's a genuine medical observation
# → The "sarcasm" label led to incorrect conclusions
#
# A neologism like "reversk" would avoid this trap
# No one would assume "reversk" means "sarcasm"

Fidelity-Familiarity Tradeoff

Drag the slider to explore different naming strategies. Watch how fidelity and familiarity trade off, and where neologisms provide the best balance.

Naming strategy Neologism

Why is "misleading familiarity" worse than having no label at all?

A familiar but imprecise label (like "sarcasm" for a feature that detects something broader) creates false confidence — researchers think they understand the feature based on the name, form incorrect mental models, and draw wrong conclusions. An unlabeled feature at least signals uncertainty. A neologism avoids both problems: it's precise and doesn't trigger incorrect associations. Because no label is always better than a wrong label Because familiar words are harder to remember

Chapter 6: Concept Space Explorer

This interactive tool lets you explore the space of human and AI concepts. Named concepts sit in the overlap. The gap — AI concepts without human words — is where neologisms are needed.

Human-AI Concept Space

Explore the concept landscape. Click on regions to see examples. Drag the "model complexity" slider to see how the vocabulary gap grows with model size. Named concepts (green), describable concepts (yellow), and unnameable concepts (red) are shown.

Model complexity Medium

The gap grows with scale. As models get larger (more parameters, more training data), they learn more fine-grained concepts — many of which fall outside human vocabulary. GPT-2 might have 50 unnameable features; GPT-4 might have 50,000. The vocabulary gap is not a fixed-size problem — it scales with AI capability. This makes neologism creation not a one-time effort but an ongoing research program.

Why does the vocabulary gap grow as models get larger?

Larger models learn more fine-grained distinctions — not just "sarcasm" but dozens of subtly different sarcasm-adjacent patterns, each a separate feature. Many of these fine-grained concepts fall between existing human categories and have no human name. The vocabulary gap scales with model capability, making neologism creation an ongoing research program. Because larger models are harder to analyze Because larger models use more words in their vocabulary

Chapter 7: Connections

This paper is part of a trilogy by Been Kim and collaborators, each addressing a different aspect of the human-AI concept gap.

Paper	Question	Answer
This paper	Can we understand AI with existing words?	No — we need neologisms
Neologism Learning	Can AI create and use its own new words?	Yes — through neologism training
Agentic Interpretability	Can we automate interpretability?	Yes — with LLM agents
AlphaZero Concepts	Can AI concepts transfer to humans?	Yes — and they improve human performance

How many neologisms do we need?

Kim estimates the scale of the vocabulary gap based on current SAE research:

python
# Estimating the neologism need
# GPT-2 Small: ~16K SAE features
#   - Named:      ~6,400 (40%)
#   - Describable: ~5,600 (35%)
#   - Unnameable:  ~4,000 (25%) → need 4,000 neologisms

# Claude/GPT-4 (estimated): ~10M SAE features
#   - Named:      ~4,000,000 (40%)
#   - Describable: ~3,500,000 (35%)
#   - Unnameable:  ~2,500,000 (25%) → need 2.5M neologisms!

# This is obviously impractical for manual creation
# Automated neologism generation (via LLM agents) is necessary
# See: Neologism Learning (Kim et al., 2025)

2.5 million new words sounds absurd, but consider that English has ~170,000 words in current use (Oxford English Dictionary) and approximately 250,000 total including obsolete words. Scientific terminology already adds thousands of new words per year. The AI vocabulary challenge is qualitatively similar — just at a much larger scale.

Not all neologisms are equally important. Of the 2.5M unnameable features, most are low-importance (rarely active, minimal impact on behavior). A practical prioritization: name the top-1000 by importance first. This covers the features that most influence model behavior and are most relevant for safety auditing. The remaining 2.4M can be named incrementally as needed.

Implications for alignment

If we can't name AI concepts, we can't audit them. Safety-critical properties like "deceptiveness" might be inadequately captured by human vocabulary. An AI might have a feature that's related to deception but not identical — and calling it "deception" would be misleading familiarity. Neologisms could enable more precise safety auditing.

The philosophical depth. This paper is ultimately about the relationship between language and understanding. The Sapir-Whorf hypothesis applied to AI: the concepts we can name determine the concepts we can think about. If AI develops concepts beyond our vocabulary, we face a genuine epistemic challenge — not just a labeling problem, but a limit on what we can know about these systems without expanding our conceptual toolkit.

"The limits of my language mean the limits of my world." — Ludwig Wittgenstein. The limits of our vocabulary mean the limits of our AI understanding.

The Concept Trilogy

Explore how Kim's three papers connect: discovering the vocabulary gap, creating neologisms, and automating the process with agents.

Paper AI Vocabulary (this paper)

What is the paper's deepest implication for AI safety?

If we can't name AI concepts, we can't audit them. Safety-critical properties may be inadequately captured by existing vocabulary — calling a feature "deception" when it's something subtly different creates misleading familiarity that undermines safety analysis. Neologisms could enable more precise identification and monitoring of potentially dangerous AI behaviors. That AI models should be smaller That interpretability research should be stopped

We Can't Understand AI Using Our Existing Vocabulary