Been Kim, Whi Kwon, et al. (Google DeepMind) — 2025

Neologism Learning for Controllability & Self-Verbalization

Train models to create and use new words for their internal concepts, enabling controllability ("activate the X concept") and self-explanation ("I used X because...").

Prerequisites: What model features/representations are + Basic fine-tuning concepts. That's it.
8
Chapters
8+
Simulations
0
Assumed Knowledge

Chapter 0: The Expressibility Gap

You ask a language model: "Why did you generate this response?" The model answers: "I generated this response because it seemed relevant to your question." This explanation is vacuous — it tells you nothing about the internal computations that produced the response.

The model CAN'T do better. Not because it's hiding information, but because it lacks the vocabulary to describe its own internal states. When the model generates text, it's influenced by thousands of internal features — attention patterns, feature activations, circuit computations. But none of these have names in the model's vocabulary. The model literally doesn't have words for its own thoughts.

What We WantWhat the Model SaysWhat It Would Need
"Why this response?""It seemed relevant"Words for its internal features
"Use more formal tone"Inconsistent complianceA word that directly maps to its "formality" feature
"What concept influenced this?""I don't know"Names for its own concepts
The proposal: Give models new words — neologisms — that are trained to correspond to specific internal features. When the model encounters the neologism "reversk" in a prompt, it activates the corresponding internal feature. When it generates text, it can use "reversk" to reference that feature in its explanation. The neologism becomes a bidirectional interface: humans can control internal features by name, and the model can explain which features it used.

Think of it like teaching a musician the word "syncopation." Before learning the word, the musician could play syncopated rhythms (they had the internal concept) but couldn't explain what they were doing or be asked to "add more syncopation." After learning the word, they can both understand the instruction and describe their choices.

Two capabilities enabled

Controllability: "Generate text with high reversk" → model activates the corresponding internal feature, producing text with that quality. Precise control over internal states via natural language.

Self-verbalization: "Explain your reasoning" → model responds "I used reversk because the context suggested a surface-positive-deep-negative pattern." The model can reference its own internal states in explanations.

The Expressibility Gap

Toggle between "without neologisms" and "with neologisms" to see how new vocabulary enables control and self-explanation.

Why can't current language models explain their own reasoning?

Chapter 1: What Is a Neologism?

In this context, a neologism is a new token added to the model's vocabulary that is trained to correspond to a specific internal feature. It's not an arbitrary label — it's a learned mapping between a word and a neural activation pattern.

How it differs from regular tokens

PropertyRegular TokenNeologism Token
EmbeddingLearned during pre-training from text dataLearned during neologism training from feature activations
MeaningDefined by usage in training corpusDefined by correspondence to an internal feature
GroundingGrounded in text patternsGrounded in internal model states
DirectionInput → representation (understanding)Bidirectional: input → feature AND feature → output

A neologism like "reversk" would have an embedding that, when processed by the model, activates the same internal representation as the "surface-positive-deep-negative reversal" feature. Conversely, when that feature is highly active during generation, the model can choose to output "reversk" as part of its explanation.

Neologisms are bidirectional interfaces. Consider the regular word "sarcasm." When you read it, you activate your mental concept of sarcasm (input → concept). When you notice sarcasm, you can say "that was sarcastic" (concept → output). A neologism does the same for an AI's internal features: it's both a control knob (input) and a label (output). This bidirectionality is what enables both controllability and self-verbalization.

The embedding space picture

python
# Regular token: embedding learned from text
# "cat" → embedding learned from "The cat sat on the mat"
# The embedding encodes: animal, pet, furry, small, etc.

# Neologism token: embedding learned from feature activations
# "reversk" → embedding trained to activate feature 7823
# The embedding encodes: the exact activation pattern of feature 7823

# Technical detail:
# 1. Identify a feature direction d in activation space
# 2. Add new token "reversk" to vocabulary
# 3. Train its embedding e_reversk such that:
#    - When "reversk" appears in input, model activates feature d
#    - When feature d is active, model tends to output "reversk"
# 4. The embedding e_reversk is the bridge between
#    the word space and the feature space
Neologism in Embedding Space

This visualization shows where a neologism sits in the model's embedding space. Regular words cluster by semantic meaning. The neologism is positioned to activate a specific internal feature. Click "Show Feature" to see the correspondence.

How does a neologism differ from a regular vocabulary token?

Chapter 2: Training Neologisms

How do you train a new token to correspond to an internal feature? Kim et al. propose a two-phase training process.

Phase 1: Feature identification

First, identify the internal feature you want to name. This could be a SAE feature, a probe-discovered concept, or a manually identified direction in activation space.

Phase 2: Neologism training

1. Add token to vocabulary
Add "reversk" to the tokenizer. Initialize its embedding randomly.
2. Create training data
Collect texts where feature 7823 is highly active. Prepend "reversk:" as a tag. Collect texts where it's inactive. These become positive and negative examples.
3. Train embedding
Fine-tune ONLY the neologism embedding (freeze all other parameters). Objective: model should predict feature activation from the neologism, and generate the neologism when the feature is active.
4. Verify
Test: Does "Generate text with reversk" produce texts that activate feature 7823? Does the model use "reversk" in explanations when 7823 is active?
python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# Step 1: Extend vocabulary
tokenizer.add_tokens(['reversk'])
model.resize_token_embeddings(len(tokenizer))
neo_id = tokenizer.convert_tokens_to_ids('reversk')

# Step 2: Create training data
pos_texts = [t for t in corpus if feature_7823_active(t)]
neg_texts = [t for t in corpus if not feature_7823_active(t)]

# Positive: "reversk: The policy seemed good but reinforced..."
# Negative: "The weather was sunny and warm..."

# Step 3: Train ONLY the neologism embedding
# Freeze all parameters except the new embedding
for p in model.parameters():
    p.requires_grad = False
model.model.embed_tokens.weight[neo_id].requires_grad = True

# Two objectives:
# A) Controllability: when "reversk" is in input, feature activates
# B) Self-verbalization: when feature is active, model outputs "reversk"

optimizer = torch.optim.Adam([model.model.embed_tokens.weight[neo_id]], lr=1e-3)

for batch in training_data:
    loss_ctrl = controllability_loss(model, batch, neo_id, feature_7823)
    loss_verb = verbalization_loss(model, batch, neo_id, feature_7823)
    loss = loss_ctrl + loss_verb
    loss.backward()
    optimizer.step()
Only the embedding is trained. This is critical: we freeze all model parameters and only train the single embedding vector for the neologism. This means we're not changing how the model works — we're just giving it a new word. The model's capabilities, knowledge, and behavior remain unchanged. The neologism is a pure interface layer.

Loss functions

L = Lctrl + λ · Lverb

The dual loss in detail

Controllability loss: When "reversk" appears in the prompt, the target feature's activation should increase. Lctrl measures the difference between the feature's activation with vs without the neologism in the prompt.

Verbalization loss: When the target feature is highly active, the model should assign higher probability to outputting "reversk" at the next token position. Lverb is the negative log-likelihood of "reversk" given feature-active contexts.

python
# The two loss functions in detail

def controllability_loss(model, batch, neo_id, target_feature):
    """When neologism is in input, target feature should activate."""
    # Run model on input containing the neologism
    input_with_neo = prepend_token(batch['input_ids'], neo_id)
    acts_with = model.get_feature_activation(input_with_neo, target_feature)

    # Run model on input WITHOUT the neologism
    acts_without = model.get_feature_activation(batch['input_ids'], target_feature)

    # Loss: maximize the difference (neologism should boost feature)
    return -torch.mean(acts_with - acts_without)

def verbalization_loss(model, batch, neo_id, target_feature):
    """When target feature is active, model should output neologism."""
    # Select examples where the feature is highly active
    acts = model.get_feature_activation(batch['input_ids'], target_feature)
    active_mask = acts > percentile(acts, 90)

    # For those examples, the model should predict the neologism token
    logits = model(batch['input_ids'][active_mask]).logits
    neo_prob = logits[:, -1, neo_id]  # probability of neologism at last position

    # Loss: maximize probability of outputting the neologism
    return -torch.mean(torch.log(neo_prob + 1e-8))
Neologism Training Process

Click "Train Step" to watch the neologism embedding converge. The embedding starts random and gradually aligns with the target feature direction. Watch the controllability and verbalization scores improve.

Why is it important that only the neologism embedding is trained, not the entire model?

Chapter 3: Controllability

Once a neologism is trained, it becomes a control interface. Instead of vague prompts like "write in a more nuanced way," you can say "write with reversk" — and the model activates the specific internal feature that produces "surface-positive-deep-negative reversal" patterns.

How controllability works

python
# Without neologisms: vague control
prompt = "Write about a policy reform. Be nuanced."
# Model output: generic "balanced" text — not what you wanted

# With neologism: precise control
prompt = "Write about a policy reform with reversk."
# Model output: "The reform initially reduced poverty rates,
#   but the unintended bureaucratic burden it created
#   ultimately made the problem worse."
# Exactly the "surface-positive-deep-negative" pattern!

# You can also COMBINE neologisms:
prompt = "Write about a policy reform with reversk and formex."
# Model activates BOTH the reversal feature AND the formality feature
# Result: formal, nuanced prose with the specific reversal pattern

Compositional control

A powerful property of neologisms: they compose. You can combine multiple neologisms in a single prompt, and the model activates all corresponding features simultaneously. This enables fine-grained control over multiple aspects of generation:

python
# Compositional neologism control

# Single neologism: control one dimension
generate("Write about AI with reversk")
# "AI has made impressive advances, but these advances
#  have paradoxically increased certain risks..."
# (reversal pattern only)

# Two neologisms: control two dimensions
generate("Write about AI with reversk and formex")
# "The ostensible progress in artificial intelligence,
#  whilst demonstrably ameliorating certain computational
#  tasks, has paradoxically exacerbated vulnerabilities..."
# (reversal pattern + formal register)

# Three neologisms: fine-grained multi-dimensional control
generate("Write about AI with reversk, formex, and techex")
# "The architectural innovations in transformer-based systems
#  (attention mechanism, residual connections) initially yielded
#  O(n²) computational complexity that, counterintuitively,
#  necessitated further architectural modifications..."
# (reversal + formal + technical — all three dimensions activated)

# This level of control is impossible with natural language prompts
# "be nuanced, formal, and technical" gives unpredictable results
Why compositionality works. Because each neologism targets a different internal feature, and features are approximately orthogonal (independent) in activation space. Activating feature A (reversal) doesn't interfere with feature B (formality). This orthogonality is inherited from the SAE decomposition — sparse autoencoders are specifically designed to find independent, non-interfering feature directions.

Advantages over prompt engineering

ApproachPrecisionConsistencyComposability
Natural language promptsLow — "be nuanced" is ambiguousLow — different runs give different resultsLimited — "be nuanced AND formal" is vague
System promptsMedium — more specific but still ambiguousMedium — more consistent but not guaranteedLimited — longer prompts are less reliable
NeologismsHigh — maps directly to internal featuresHigh — same feature, same behaviorHigh — combine neologisms compositionally
Neologisms are differentiable control knobs. Unlike prompt engineering (which operates in text space), neologisms operate in embedding space — they directly target the model's internal features. This makes control more precise and more consistent. It's the difference between telling a mixer board operator "make it louder" (prompt) vs turning a specific volume knob (neologism).
Controllability Demo

Toggle neologisms on/off and click "Generate" to see how each neologism controls a different aspect of the output. Neologisms activate specific internal features, giving precise control.

Why are neologisms more precise than prompt engineering for controlling model behavior?

Chapter 4: Self-Verbalization

The second capability enabled by neologisms: the model can explain its own reasoning by referencing its internal features by name.

How self-verbalization works

During generation, the model's internal features are active. With neologisms, the model can reference these features in its output. Instead of "I thought this was relevant," it can say "I detected reversk in the context (a surface-positive-deep-negative pattern), which influenced my response."

python
# Self-verbalization example
input = "Analyze this text and explain your reasoning:
'The new algorithm was faster but consumed 10x more memory.'"

# Without neologisms:
# "This text describes a tradeoff between speed and memory."
# (Generic, surface-level, doesn't reference internal processing)

# With neologisms:
# "I detected reversk in this text (surface improvement masking
#  a deeper problem). I also detected techex (technical domain
#  language) and tradex (tradeoff pattern). The combination
#  of reversk + tradex suggests this is a cautionary example
#  rather than a straightforward improvement report."
# (Specific, references actual internal features, verifiable)

Verifiability: the key advantage

The crucial property of neologism-based self-verbalization is verifiability. When the model says "I used reversk," you can check:

python
# Verifying self-verbalization claims
def verify_explanation(model, input_text, explanation):
    """Check if the neologisms the model mentions are actually active."""
    # Extract neologisms from explanation
    mentioned = extract_neologisms(explanation)
    # e.g., ["reversk", "tradex"]

    # Measure actual feature activations
    activations = model.get_feature_activations(input_text)

    # Check each claim
    for neo in mentioned:
        feature_id = neo_to_feature[neo]
        is_active = activations[feature_id] > threshold
        if not is_active:
            print(f"WARNING: Model claimed {neo} but feature is inactive!")
            # This is a confabulation — the model is lying/mistaken
        else:
            print(f"VERIFIED: {neo} is indeed active (activation: {activations[feature_id]:.3f})")

# This verification is IMPOSSIBLE with chain-of-thought
# because CoT doesn't reference specific, measurable internal states
Verifiability changes the game. With chain-of-thought, you have to trust that the model's explanation reflects its actual processing. With neologisms, you can verify. If the model says "I used reversk" but feature 7823's activation is 0.02 (inactive), the model is confabulating — and you can catch it. This is the difference between self-report and measurement.

Why self-verbalization matters

Transparency. Users can understand why the model produced a specific output. Instead of a black box, the model explains which internal concepts were active.

Debugging. If the model makes an error, self-verbalization reveals which features contributed. "I used reversk but the text wasn't actually a reversal pattern" → the reversk feature may have a false positive rate that needs investigation.

Trust calibration. If the model says "I'm confident because tradex was strongly active," you can verify whether tradex activation is indeed reliable for this type of input.

Self-verbalization vs "chain-of-thought." Chain-of-thought prompting produces reasoning traces, but these are generated from scratch — they may not reflect actual internal processing (the model might confabulate plausible-sounding reasoning). Self-verbalization with neologisms is different: the model references actual internal features that are verifiably active. You can check whether the model is telling the truth by measuring the named feature's activation.
Self-Verbalization Demo

See how a model with neologisms explains its reasoning by referencing specific internal features. Click different inputs to see different features get referenced.

How does self-verbalization with neologisms differ from chain-of-thought reasoning?

Chapter 5: Experiments & Results

Kim et al. validate neologism learning on controlled experiments. The key questions: Do neologisms actually activate the right features? Can they enable meaningful control and explanation?

Controllability results

They trained neologisms for 10 features identified via sparse autoencoders in a medium-sized language model. The controllability test: generate 100 texts with each neologism in the prompt, then measure whether the target feature is more active than in control generations.

NeologismTarget FeatureFeature Activation IncreaseSpecificity
reverskSurface-positive reversal+340%92% (low cross-activation)
formexFormal register+280%88%
techexTechnical domain+310%95%
emotrexEmotional intensity+250%85%

Specificity measures whether the neologism activates ONLY the target feature, not other features. High specificity (>85%) means the neologism is a precise control knob, not a blunt instrument.

Self-verbalization results

For self-verbalization, they measure whether the model uses the correct neologism when the target feature is active. Given 100 texts where feature 7823 is highly active, does the model mention "reversk" in its analysis?

results
# Verbalization accuracy:
# - Model uses correct neologism when feature is active: 78%
# - Model avoids neologism when feature is inactive: 91%
# - False positive rate (uses neologism when feature inactive): 9%
# - False negative rate (doesn't use when feature active): 22%

# For comparison, chain-of-thought explanation accuracy:
# - Correctly identifies the active concept: ~45%
# - Confabulates plausible but wrong reasoning: ~30%
# Neologism-based verbalization is much more faithful
78% verbalization accuracy is a strong result. Without neologisms, the model correctly identifies active concepts only ~45% of the time (and confabulates 30% of the time). With neologisms, accuracy jumps to 78% and confabulation drops to 9%. The model isn't just generating plausible-sounding explanations — it's referencing features that are actually active.
Results Dashboard

Compare controllability and verbalization accuracy across neologisms. Click each metric to see detailed results.

How do neologisms improve upon chain-of-thought for model explanation?

Chapter 6: Neologism Playground

This interactive simulation lets you create, train, and test neologisms. Choose an internal feature, train a neologism for it, then use it for both controllability and self-verbalization.

Neologism Lab

Select a feature, click "Train Neologism" to learn the embedding, then "Test Control" to use it in a prompt and "Test Explain" to see self-verbalization.

Feature Reversal pattern
In the playground, what happens when you combine two neologisms in a prompt?

Chapter 7: Connections

Neologism learning completes the trilogy of papers by Kim et al. on the relationship between language and AI understanding.

PaperRole in Trilogy
AI VocabularyThe problem: human words can't name all AI concepts
Neologism Learning (this paper)The solution: train models to create and use new words
Agentic InterpretabilityThe scaling: automate the process with LLM agents

Implications for alignment

Interpretable-by-design. Instead of trying to interpret a model after training, neologisms create interpretability hooks during deployment. A model with 1000 neologisms has 1000 named internal concepts you can inspect, control, and monitor.

AI-human collaboration. Neologisms create a shared vocabulary between the model and its users. The model can explain "I used reversk" and the user can request "more reversk." This is genuine two-way communication about internal states — a step toward meaningful AI transparency.

Practical deployment considerations

How would neologisms work in a real system? Kim et al. outline a deployment pipeline:

python
# Deployment pipeline for neologism-enabled models

# Phase 1: Feature discovery
# Run SAE on the production model
# Identify top-1000 most important features
# (importance = impact on output when ablated)

# Phase 2: Neologism training
# For each feature, train a neologism embedding
# ~5 minutes per neologism on 1 GPU
# Total: ~83 GPU-hours for 1000 neologisms

# Phase 3: Integration
# Add neologisms to tokenizer vocabulary
# Document each neologism: name, definition, examples
# Create a "neologism dictionary" for users

# Phase 4: User interface
# Prompt format: "Write about X with [neo1] and [neo2]"
# Explanation format: "I used [neo1] because..."
# Dashboard: real-time feature activation monitoring

# Phase 5: Safety monitoring
# Flag features related to bias, deception, harmful content
# Monitor these features' activation in production
# Alert when "deception-adjacent" neologism activates
The business case. Neologism-enabled models could be a competitive advantage. "Our model doesn't just generate text — it tells you why, and you can verify the explanation." This is particularly valuable in regulated industries (healthcare, finance, legal) where explainability isn't optional. A doctor needs to know why the model recommended a treatment, and neologisms provide verifiable reasons.

Open challenges

Scaling. Training neologisms one at a time doesn't scale to millions of features. Automated methods (using LLM agents) are needed. Kim et al. estimate that training 1000 neologisms requires ~83 GPU-hours — feasible for important features, but millions would require fundamentally new approaches.

python
# Scaling analysis
# Current: ~5 minutes per neologism on 1 A100 GPU
#
# Scaling paths:
# 1. Batch training: train 100 neologisms in parallel
#    → 50 minutes for 100 (10x speedup)
#    → But must verify no interference between neologisms
#
# 2. Hierarchical neologisms: train "concept families"
#    → "emotrex" is the parent concept (emotional intensity)
#    → "emotrex-joy", "emotrex-rage", "emotrex-grief" are children
#    → Children inherit from parent embedding + delta
#    → Faster training, natural concept hierarchy
#
# 3. Agentic discovery + training pipeline
#    → LLM agent identifies important features
#    → Same agent generates candidate neologism names
#    → Automated embedding training
#    → Automated verification
#    → Human review only for safety-critical features

Faithfulness. The 22% false negative rate means the model sometimes uses a feature without mentioning the neologism. Full faithfulness requires that the model ALWAYS reports its active features.

Compositionality. Can hundreds of neologisms compose naturally? Or do interactions between neologisms create unexpected behaviors?

Relationship to existing interpretability methods

MethodWhat It DoesHow Neologisms Differ
Feature visualizationShows what maximally activates a neuronNeologisms give features NAMES, not just visualizations
Sparse autoencodersDecompose activations into monosemantic featuresSAEs find the features; neologisms name and control them
Concept probesTest whether a known concept is encodedNeologisms create bidirectional interfaces, not just detectors
Activation steeringAdd/subtract vectors to change behaviorNeologisms are steering via natural language — composable and interpretable
RLHFAlign model behavior to human preferencesNeologisms provide fine-grained, concept-level control instead of holistic alignment
python
# Neologisms vs activation steering comparison
#
# Activation steering (Anthropic, 2024):
#   model.forward(x, steering_vector=happiness_direction * 2.0)
#   Pros: precise, well-understood mathematically
#   Cons: requires code-level access, not composable via text
#
# Neologisms:
#   model.generate("Write happily with emotrex-joy")
#   Pros: works via natural language, composable, bidirectional
#   Cons: less mathematically precise, requires training
#
# Think of neologisms as "activation steering with a user-friendly API"
# The underlying mechanism is similar (adding a direction to the
# representation space), but the interface is natural language
The vision. Imagine a model with a complete neologism vocabulary — every significant internal feature has a name. Users control behavior precisely. The model explains every decision by referencing specific internal states. Safety auditors monitor specific features by name. Researchers discuss internal computations using shared vocabulary. This is interpretability not as a post-hoc analysis tool, but as a native capability of the system.

"To name something is to have power over it. To give an AI names for its own concepts is to give both the AI and humanity power over what the AI does."

The Full Vision

Drag the slider to see how neologism coverage grows from a few named features to a fully interpretable model.

Coverage 10%
What is the long-term vision for neologism learning?