Human concepts may be insufficient to describe AI representations. We need new vocabulary — "neologisms" — to name novel AI concepts that have no human equivalent.
You're an interpretability researcher. You've found a feature in GPT-4 that activates strongly on certain inputs. You examine the top-activating texts and try to describe the pattern. The feature fires for:
| Input | Activation |
|---|---|
| "The policy seemed progressive but actually reinforced the status quo" | 0.92 |
| "He appeared generous while secretly hoarding resources" | 0.89 |
| "The medication showed initial improvement followed by a rebound effect" | 0.85 |
| "The reform paradoxically strengthened what it tried to weaken" | 0.91 |
What is this feature? "Irony"? Not quite — it doesn't fire on all ironic statements. "Contradiction"? Too broad. "Backfire effect"? Closer, but it also fires on non-causal examples. You try increasingly complex descriptions: "surface-level positive outcome concealing or producing an opposite deeper effect." But this 11-word phrase isn't really a concept name — it's a sentence.
This is the naming problem. The feature represents a real, coherent concept in the model's representation space. It's monosemantic — it activates for a specific pattern. But human language doesn't have a word for that pattern.
Think of it like color perception. Before the word "orange" existed in English (borrowed from the fruit in the 16th century), English speakers could see the color — but they described it as "red-yellow." The introduction of "orange" didn't change perception; it gave people a word that made communication about that specific experience efficient and precise. AI concepts need the same treatment.
Click each feature to see top-activating texts. Try to name the concept. Notice how some features resist simple naming — they represent patterns that fall between existing human concepts.
Why would AI develop concepts that humans don't have? The answer lies in how concepts form. Human concepts are shaped by evolution, embodiment, language, and culture. AI concepts are shaped by data patterns and loss functions. These are fundamentally different pressures that produce different conceptual systems.
| Pressure | Human Concepts | AI Concepts |
|---|---|---|
| Survival | "Danger", "food", "mate" — evolved for fitness | No survival pressure — learns whatever predicts tokens |
| Embodiment | "Hot", "heavy", "near" — grounded in physical experience | No body — concepts are purely statistical |
| Language | Concepts that have names get reinforced | Learns from language but isn't constrained to named concepts |
| Scale | ~100K concepts in a lifetime | Millions of features — many more fine-grained distinctions |
Humans developed the concept "sarcasm" because it matters for social interaction. We never developed a concept for "surface-positive-deep-negative-reversal" because we didn't need one specific word for that pattern — we could describe it in context. But an AI trained on billions of sentences encounters that pattern enough times that it's worth having a dedicated feature.
The Sapir-Whorf hypothesis in linguistics suggests that the language you speak shapes the concepts you can think about. Kim extends this to AI: the concepts we can name constrain which AI concepts we can understand. If we can't name a concept, we struggle to reason about it, communicate about it, and build on it.
The claim that vocabulary shapes understanding isn't philosophy — it has experimental support. Winawer et al. (2007) showed that Russian speakers (who have two words for blue: "goluboy" for light blue, "siniy" for dark blue) distinguish blue shades faster than English speakers (who have one word: "blue"). The word creates a categorical boundary in perception.
Similarly, Lupyan & Ward (2013) found that people with larger vocabularies perceive visual differences more accurately. Naming concepts makes them cognitively available for reasoning, comparison, and composition. The vocabulary gap in AI interpretability is an impediment to understanding, not just to communication.
python # Thought experiment: chemistry without element names # # Instead of "H₂O" you'd write: # "Two atoms of the lightest gas combined with one atom # of the gas that sustains combustion" # # Chemical reactions become unreadable: # "The lightest-gas-combustion-gas compound reacts with # the soft-silvery-metal to produce..." # # This is EXACTLY what interpretability researchers do: # "Feature 7823 — the one that activates for surface-level # positive outcomes concealing deeper negative effects — # interacts with Feature 12041..." # # With neologisms: "reversk interacts with abstrax" # Same meaning, 5 words instead of 30
python # Analogy: color naming across languages # # Russian has two words for blue: # "голубой" (goluboy) = light blue # "синий" (siniy) = dark blue # Russian speakers distinguish these colors FASTER than # English speakers, who call both "blue" # # Similarly: if we have a word for an AI concept, # researchers can recognize, discuss, and build on it faster # Without a word, the concept remains fuzzy and hard to study # The naming spectrum: # "cats" → human word exists, AI feature matches # "formal-to-informal" → human words exist, but the combination is novel # "?????" → no human description captures it precisely
Compare how human concepts form (shaped by survival, embodiment, language) vs. how AI concepts form (shaped by data patterns and prediction objectives). Click to toggle between human and AI concept spaces.
Kim formalizes the problem. Let H be the set of all human concepts (things we have words for) and A be the set of all AI concepts (features in the model). The vocabulary gap is A \ H — concepts in A that have no counterpart in H.
How big is this gap? Empirical evidence suggests it's large. When researchers at Anthropic trained sparse autoencoders on Claude and labeled the resulting features, they found three categories:
| Category | % of Features | Description |
|---|---|---|
| Named concepts | ~40% | Match a human concept: "animals", "code syntax", "French" |
| Describable patterns | ~35% | Can be described but have no single word: "transition from formal to casual register" |
| Unnameable | ~25% | No clear human description: researchers write "???" or multi-sentence descriptions |
The 25% "unnameable" features are the vocabulary gap. These are real features — they activate consistently, they're monosemantic (respond to one pattern), and they influence model behavior. We just can't name them.
The vocabulary gap has real consequences for interpretability research:
Communication failure. If researcher A finds an unnameable feature and describes it in 50 words, researcher B may misinterpret the description. A single word ("sarcasm") communicates instantly; 50 words communicate approximately.
Reasoning bottleneck. It's harder to reason about concepts you can't name. Try thinking about chess strategy if you don't know the words "fork" or "pin" — you can see the patterns but you can't efficiently analyze them.
Cumulative knowledge failure. Science progresses by naming things. Taxonomy precedes theory. If we can't name AI concepts, we can't build taxonomies, can't formulate theories about how they interact, and can't train the next generation of researchers to recognize them.
This Venn diagram shows the overlap between human concepts and AI concepts. The gap (AI concepts with no human name) is where neologisms are needed. Drag the slider to see how the gap changes with model scale.
Kim provides concrete evidence that AI concepts don't reduce to human concepts. The evidence comes from three sources: sparse autoencoder features, multimodal models, and cross-model comparison.
When researchers train sparse autoencoders on LLMs, many discovered features correspond to known concepts ("the Golden Gate Bridge", "Python code", "medical terminology"). But a significant fraction don't. Kim catalogs examples of features that resist human naming:
examples of unnameable features # Feature 7823: Activates for # - "The meeting was productive but something felt off" # - "Her smile didn't reach her eyes" # - "The data showed improvement yet the trend worried analysts" # Best human description: "surface-level positive with underlying negative" # But this misses cases where it activates without explicit sentiment # Feature 12410: Activates for # - Sentences about to undergo a topic shift # - But ONLY when the current topic is concrete and the next is abstract # - "The bridge cost $4 billion. Democracy requires..." # Best description: "concrete-to-abstract topic transition" # But this is a 5-word description, not a concept name # Feature 19001: Activates for # - ??? Researchers genuinely cannot characterize the pattern # - Activations are consistent (high inter-annotator agreement # that the same inputs activate it) # - But no human can articulate what the inputs share
Kim proposes a formal metric for measuring how well human labels describe AI features. The description fidelity score measures whether a human-language label can predict a feature's activations:
A fidelity score of 0.95 means the label almost perfectly predicts the feature. A fidelity score of 0.50 means the label is no better than random — the feature encodes something the label doesn't capture.
python # Measuring description fidelity def measure_fidelity(label, feature, test_inputs, evaluator_llm): """How well does a human-language label predict feature activations?""" correct = 0 for text in test_inputs: # Ask evaluator: "Given label L, would this text activate the feature?" predicted = evaluator_llm.predict( f"Feature description: {label}\nText: {text}\nWould this activate? (yes/no)" ) actual = feature.get_activation(text) > threshold if predicted == actual: correct += 1 return correct / len(test_inputs) # Results across feature categories: # Named concepts: avg fidelity = 0.93 (labels work!) # Describable: avg fidelity = 0.72 (labels approximate) # Unnameable: avg fidelity = 0.48 (labels fail — near random)
Kim proposes a test for whether a concept is genuinely novel: if three expert interpretability researchers independently fail to produce a concise label that accurately predicts the feature's activations (simulation score >80%), the concept is a candidate for neologism creation.
Kim presents Feature 19001 as a detailed case study of an unnameable concept. Three interpretability researchers spent 4 hours each trying to label it. Here's what happened:
case study # Feature 19001: Top-10 activating texts # 1. "The meeting ended abruptly when she brought up the budget" # 2. "He paused mid-sentence, reconsidering his words" # 3. "The conversation shifted to something unexpected" # 4. "She changed the subject after a brief silence" # 5. "The mood in the room suddenly became tense" # 6. "His expression changed when she mentioned the project" # 7. "There was an awkward pause before anyone spoke" # 8. "The dynamic between them shifted perceptibly" # 9. "Something unspoken passed between the two colleagues" # 10. "The negotiation took an unexpected turn" # # Researcher A: "social tension moments" # Simulation score: 0.61 (too broad — fires on non-tense moments too) # # Researcher B: "conversational pivot points" # Simulation score: 0.58 (too narrow — misses non-verbal examples) # # Researcher C: "interpersonal state transitions" # Simulation score: 0.55 (too abstract — predicts many false positives) # # All three labels hover near chance (0.50) # The feature is real (high inter-annotator agreement on what activates it) # But no human label captures the pattern
Not all unnameable features are unnameable in the same way. Kim identifies three subtypes:
| Subtype | Why It Resists Naming | Example | % of Unnameable |
|---|---|---|---|
| Cross-categorical | The feature spans multiple human concept categories simultaneously | Activates for "authority signals" — combines formal language, declarative structure, and institutional context into one feature | ~45% |
| Sub-categorical | The feature is a precise subdivision of a human concept that we never split | "Sarcasm-subset-3": activates for sarcasm that uses understatement but NOT for sarcasm that uses exaggeration | ~35% |
| Alien | The feature detects a pattern that humans don't perceive at all | Activates for texts that are about to undergo a specific statistical transition in token probability distribution | ~20% |
Cross-categorical features are the most common: they carve the world differently from human categories, combining things we separate or separating things we combine. Sub-categorical features are finer-grained versions of existing concepts. Alien features are the most fascinating — they detect patterns that may be genuinely imperceptible to humans.
python # Examples of each subtype # Cross-categorical: Feature 8842 # Activates for: formal requests, passive-aggressive emails, # diplomatic statements, corporate apologies # Human would separate: formality, passive-aggression, diplomacy # The AI sees ONE pattern: "surface politeness encoding constraint" # Sub-categorical: Feature 5510 # Activates for: sarcasm using litotes ("not entirely wrong") # Does NOT activate for: sarcasm using hyperbole ("oh GREAT") # Humans call both "sarcasm" — the AI distinguishes them # Alien: Feature 19001 # Activates for: texts where an interpersonal dynamic shifts # Humans can recognize it (high inter-annotator agreement) # But cannot articulate the pattern (simulation score ~0.50)
Multimodal models (like CLIP) learn concepts that span vision and language simultaneously. Some of these cross-modal concepts have no human equivalent because humans process vision and language in separate brain regions. A CLIP feature might encode "the visual-linguistic pattern of authority" — a concept that combines visual cues (posture, framing) with linguistic cues (formal vocabulary, declarative structure) in a way humans never unified into a single concept.
python # Cross-modal concepts in CLIP: evidence # # Feature 7291 in CLIP ViT-L/14 activates for: # # VISUAL inputs: # - Images with centered subject, low angle, warm lighting # - Portraits of people in professional attire # - Images of podiums, stages, official settings # # TEXT inputs: # - "The CEO addressed the shareholders" # - "According to leading experts" # - "The official statement reads..." # # Humans have SEPARATE concepts for visual and textual authority: # Photography: "low angle shot" (makes subject look powerful) # Rhetoric: "appeal to authority" (linguistic persuasion) # # CLIP has ONE concept that spans both modalities. # This cross-modal "authority" concept has no human name. # It's not "authority" (too general) — it's specifically the # SENSORY SIGNATURE of authority across vision and language.
Explore how SAE features break down into named, describable, and unnameable categories. Click each category to see examples. The "unnameable" category is the vocabulary gap.
If human vocabulary is insufficient, we need to create new words. Kim proposes a framework for creating neologisms — new words specifically designed to name AI concepts.
| Property | Why It Matters | Example |
|---|---|---|
| Pronounceable | Must be spoken in discussions, not just written | "voltex" not "x7f2a" |
| Memorable | Researchers must recall and use it | Short, distinctive sound pattern |
| Precise | Must map to exactly one concept | One neologism per feature, no ambiguity |
| Compositional | Can combine with existing words | "voltex-sensitive", "high voltex" |
Kim proposes several strategies:
Strategy 1: Morphological combination. Combine existing morphemes to create a new word that hints at the concept. Example: "contraflow" for the "surface-positive-deep-negative" feature (contra + flow). This is how most natural neologisms form ("smartphone" = smart + phone).
Strategy 2: Semantic compression. Use an LLM to compress a multi-sentence description into a single novel word. The LLM can be trained or prompted to generate pronounceable neologisms that are semantically evocative.
Strategy 3: Arbitrary assignment. Assign a short arbitrary label (like chemical element symbols). Less intuitive but avoids misleading connotations. "Feature V7" is less biased than "contraflow" (which might wrongly suggest the feature is about rivers).
python # Generating neologisms with an LLM prompt = """You are creating new English words for AI concepts. Given a description of a neural network feature, generate 3 candidate neologisms that are: - Pronounceable (1-3 syllables) - Memorable - Evocative of the concept without being misleading Feature description: "Activates when a sentence describes a surface-level positive outcome that conceals or produces a deeper negative effect. Examples: backfire effects, pyrrhic victories, deceptive improvements." Candidate neologisms:""" # LLM output: # 1. "reversk" (reverse + mask) — the hidden reversal # 2. "velix" (veil + paradox) — the veiled paradox # 3. "contravene" (contra + veneer) — against the surface
Click "Generate" to see candidate neologisms for different unnameable features. Each neologism is tested for pronounceability, memorability, and precision.
Kim identifies a fundamental tradeoff in naming AI concepts: fidelity (how accurately the name captures the concept) vs. familiarity (how easily a human can understand and use the name).
At one extreme, you can use a familiar name like "sarcasm" — easy to understand but inaccurate (the feature doesn't exactly match human sarcasm). At the other extreme, you can use a precise description like "surface-positive-deep-negative reversal in consequentialist contexts" — accurate but unusable in conversation.
| Name | Fidelity | Familiarity | Product |
|---|---|---|---|
| "sarcasm" | 0.4 (too broad) | 1.0 (everyone knows it) | 0.40 |
| "sarcasm-adjacent reversal pattern" | 0.7 | 0.5 | 0.35 |
| "contraflow" (neologism) | 0.85 | 0.7 (after learning) | 0.60 |
| "surface-positive-deep-negative..." | 0.95 | 0.2 (a sentence) | 0.19 |
The neologism "contraflow" wins because it achieves high fidelity (precisely naming the concept) with reasonable familiarity (learnable, memorable, usable).
Kim draws an analogy to scientific terminology throughout history. When scientists discover genuinely new phenomena, they create new words:
| Neologism | Year | Why Existing Words Failed |
|---|---|---|
| "Oxygen" | 1777 | "Dephlogisticated air" was a description, not a concept name |
| "Quark" | 1964 | No existing word for a fundamental particle with fractional charge |
| "Gene" | 1909 | "Unit of heredity" was a description — "gene" made it a concept |
| "Meme" | 1976 | "Culturally transmitted idea unit" — Dawkins needed one word |
| "Doomscrolling" | 2020 | "Compulsively reading negative news on phone" — too long |
In each case, the neologism didn't create the phenomenon — it already existed. The word made it thinkable, communicable, and researchable. AI concepts need the same treatment.
python # The efficiency of naming # Without neologism: 50+ words to discuss the concept # "Did you find any features that activate when the text # describes a surface-level positive outcome that actually # produces a deeper negative effect?" # # With neologism: 5 words # "Did you find any reversk features?" # # Over 100 research discussions, this saves ~4500 words # More importantly: it prevents misunderstanding # "surface-level positive" might be interpreted differently # by different researchers. "reversk" is unambiguous.
A common mistake: labeling an AI feature with a familiar but imprecise human word. Calling a feature "sarcasm" when it's not exactly sarcasm creates misleading familiarity. Researchers think they understand the feature ("oh, it detects sarcasm") but their mental model is wrong. This is worse than having no label at all — it creates false confidence.
python # The misleading familiarity trap # # Researcher labels Feature 7823 as "sarcasm" # → Team assumes it detects sarcasm # → They test: "Oh great, another meeting" → activates? YES ✓ # → They conclude the feature works for sarcasm detection # → But it ALSO activates for non-sarcastic backfire effects: # "The medication helped initially but worsened symptoms" # → This is NOT sarcasm — it's a genuine medical observation # → The "sarcasm" label led to incorrect conclusions # # A neologism like "reversk" would avoid this trap # No one would assume "reversk" means "sarcasm"
Drag the slider to explore different naming strategies. Watch how fidelity and familiarity trade off, and where neologisms provide the best balance.
This interactive tool lets you explore the space of human and AI concepts. Named concepts sit in the overlap. The gap — AI concepts without human words — is where neologisms are needed.
Explore the concept landscape. Click on regions to see examples. Drag the "model complexity" slider to see how the vocabulary gap grows with model size. Named concepts (green), describable concepts (yellow), and unnameable concepts (red) are shown.
This paper is part of a trilogy by Been Kim and collaborators, each addressing a different aspect of the human-AI concept gap.
| Paper | Question | Answer |
|---|---|---|
| This paper | Can we understand AI with existing words? | No — we need neologisms |
| Neologism Learning | Can AI create and use its own new words? | Yes — through neologism training |
| Agentic Interpretability | Can we automate interpretability? | Yes — with LLM agents |
| AlphaZero Concepts | Can AI concepts transfer to humans? | Yes — and they improve human performance |
Kim estimates the scale of the vocabulary gap based on current SAE research:
python # Estimating the neologism need # GPT-2 Small: ~16K SAE features # - Named: ~6,400 (40%) # - Describable: ~5,600 (35%) # - Unnameable: ~4,000 (25%) → need 4,000 neologisms # Claude/GPT-4 (estimated): ~10M SAE features # - Named: ~4,000,000 (40%) # - Describable: ~3,500,000 (35%) # - Unnameable: ~2,500,000 (25%) → need 2.5M neologisms! # This is obviously impractical for manual creation # Automated neologism generation (via LLM agents) is necessary # See: Neologism Learning (Kim et al., 2025)
2.5 million new words sounds absurd, but consider that English has ~170,000 words in current use (Oxford English Dictionary) and approximately 250,000 total including obsolete words. Scientific terminology already adds thousands of new words per year. The AI vocabulary challenge is qualitatively similar — just at a much larger scale.
If we can't name AI concepts, we can't audit them. Safety-critical properties like "deceptiveness" might be inadequately captured by human vocabulary. An AI might have a feature that's related to deception but not identical — and calling it "deception" would be misleading familiarity. Neologisms could enable more precise safety auditing.
"The limits of my language mean the limits of my world." — Ludwig Wittgenstein. The limits of our vocabulary mean the limits of our AI understanding.
Explore how Kim's three papers connect: discovering the vocabulary gap, creating neologisms, and automating the process with agents.