Neologism Learning (Kim 2025)

Chapter 0: The Expressibility Gap

You ask a language model: "Why did you generate this response?" The model answers: "I generated this response because it seemed relevant to your question." This explanation is vacuous — it tells you nothing about the internal computations that produced the response.

The model CAN'T do better. Not because it's hiding information, but because it lacks the vocabulary to describe its own internal states. When the model generates text, it's influenced by thousands of internal features — attention patterns, feature activations, circuit computations. But none of these have names in the model's vocabulary. The model literally doesn't have words for its own thoughts.

What We Want	What the Model Says	What It Would Need
"Why this response?"	"It seemed relevant"	Words for its internal features
"Use more formal tone"	Inconsistent compliance	A word that directly maps to its "formality" feature
"What concept influenced this?"	"I don't know"	Names for its own concepts

The proposal: Give models new words — neologisms — that are trained to correspond to specific internal features. When the model encounters the neologism "reversk" in a prompt, it activates the corresponding internal feature. When it generates text, it can use "reversk" to reference that feature in its explanation. The neologism becomes a bidirectional interface: humans can control internal features by name, and the model can explain which features it used.

Think of it like teaching a musician the word "syncopation." Before learning the word, the musician could play syncopated rhythms (they had the internal concept) but couldn't explain what they were doing or be asked to "add more syncopation." After learning the word, they can both understand the instruction and describe their choices.

Two capabilities enabled

Controllability: "Generate text with high reversk" → model activates the corresponding internal feature, producing text with that quality. Precise control over internal states via natural language.

Self-verbalization: "Explain your reasoning" → model responds "I used reversk because the context suggested a surface-positive-deep-negative pattern." The model can reference its own internal states in explanations.

The Expressibility Gap

Toggle between "without neologisms" and "with neologisms" to see how new vocabulary enables control and self-explanation.

Why can't current language models explain their own reasoning?

Models lack vocabulary for their own internal states — they have thousands of internal features (attention patterns, concept activations) but no words that map to those features. Without words for their own computations, they can only produce vague explanations. Neologisms would give them a vocabulary for self-reference. Because models are too simple to have internal states Because reasoning is too complex to put into words

Chapter 1: What Is a Neologism?

In this context, a neologism is a new token added to the model's vocabulary that is trained to correspond to a specific internal feature. It's not an arbitrary label — it's a learned mapping between a word and a neural activation pattern.

How it differs from regular tokens

Property	Regular Token	Neologism Token
Embedding	Learned during pre-training from text data	Learned during neologism training from feature activations
Meaning	Defined by usage in training corpus	Defined by correspondence to an internal feature
Grounding	Grounded in text patterns	Grounded in internal model states
Direction	Input → representation (understanding)	Bidirectional: input → feature AND feature → output

A neologism like "reversk" would have an embedding that, when processed by the model, activates the same internal representation as the "surface-positive-deep-negative reversal" feature. Conversely, when that feature is highly active during generation, the model can choose to output "reversk" as part of its explanation.

Neologisms are bidirectional interfaces. Consider the regular word "sarcasm." When you read it, you activate your mental concept of sarcasm (input → concept). When you notice sarcasm, you can say "that was sarcastic" (concept → output). A neologism does the same for an AI's internal features: it's both a control knob (input) and a label (output). This bidirectionality is what enables both controllability and self-verbalization.

The embedding space picture

python
# Regular token: embedding learned from text
# "cat" → embedding learned from "The cat sat on the mat"
# The embedding encodes: animal, pet, furry, small, etc.

# Neologism token: embedding learned from feature activations
# "reversk" → embedding trained to activate feature 7823
# The embedding encodes: the exact activation pattern of feature 7823

# Technical detail:
# 1. Identify a feature direction d in activation space
# 2. Add new token "reversk" to vocabulary
# 3. Train its embedding e_reversk such that:
#    - When "reversk" appears in input, model activates feature d
#    - When feature d is active, model tends to output "reversk"
# 4. The embedding e_reversk is the bridge between
#    the word space and the feature space

Neologism in Embedding Space

This visualization shows where a neologism sits in the model's embedding space. Regular words cluster by semantic meaning. The neologism is positioned to activate a specific internal feature. Click "Show Feature" to see the correspondence.

How does a neologism differ from a regular vocabulary token?

A neologism's embedding is trained to correspond to a specific internal feature, not learned from text data. It works bidirectionally: when it appears in input, it activates the feature (controllability), and when the feature is active during generation, the model can output the neologism (self-verbalization). Regular tokens are grounded in text; neologisms are grounded in internal states. A neologism is just a new word added to the vocabulary A neologism replaces an existing token

Chapter 2: Training Neologisms

How do you train a new token to correspond to an internal feature? Kim et al. propose a two-phase training process.

Phase 1: Feature identification

First, identify the internal feature you want to name. This could be a SAE feature, a probe-discovered concept, or a manually identified direction in activation space.

Phase 2: Neologism training

1. Add token to vocabulary

Add "reversk" to the tokenizer. Initialize its embedding randomly.

↓

2. Create training data

Collect texts where feature 7823 is highly active. Prepend "reversk:" as a tag. Collect texts where it's inactive. These become positive and negative examples.

↓

3. Train embedding

Fine-tune ONLY the neologism embedding (freeze all other parameters). Objective: model should predict feature activation from the neologism, and generate the neologism when the feature is active.

↓

4. Verify

Test: Does "Generate text with reversk" produce texts that activate feature 7823? Does the model use "reversk" in explanations when 7823 is active?

python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# Step 1: Extend vocabulary
tokenizer.add_tokens(['reversk'])
model.resize_token_embeddings(len(tokenizer))
neo_id = tokenizer.convert_tokens_to_ids('reversk')

# Step 2: Create training data
pos_texts = [t for t in corpus if feature_7823_active(t)]
neg_texts = [t for t in corpus if not feature_7823_active(t)]

# Positive: "reversk: The policy seemed good but reinforced..."
# Negative: "The weather was sunny and warm..."

# Step 3: Train ONLY the neologism embedding
# Freeze all parameters except the new embedding
for p in model.parameters():
    p.requires_grad = False
model.model.embed_tokens.weight[neo_id].requires_grad = True

# Two objectives:
# A) Controllability: when "reversk" is in input, feature activates
# B) Self-verbalization: when feature is active, model outputs "reversk"

optimizer = torch.optim.Adam([model.model.embed_tokens.weight[neo_id]], lr=1e-3)

for batch in training_data:
    loss_ctrl = controllability_loss(model, batch, neo_id, feature_7823)
    loss_verb = verbalization_loss(model, batch, neo_id, feature_7823)
    loss = loss_ctrl + loss_verb
    loss.backward()
    optimizer.step()

Only the embedding is trained. This is critical: we freeze all model parameters and only train the single embedding vector for the neologism. This means we're not changing how the model works — we're just giving it a new word. The model's capabilities, knowledge, and behavior remain unchanged. The neologism is a pure interface layer.

Loss functions

L = L_ctrl + λ · L_verb

The dual loss in detail

Controllability loss: When "reversk" appears in the prompt, the target feature's activation should increase. L_ctrl measures the difference between the feature's activation with vs without the neologism in the prompt.

Verbalization loss: When the target feature is highly active, the model should assign higher probability to outputting "reversk" at the next token position. L_verb is the negative log-likelihood of "reversk" given feature-active contexts.

python
# The two loss functions in detail

def controllability_loss(model, batch, neo_id, target_feature):
    """When neologism is in input, target feature should activate."""
    # Run model on input containing the neologism
    input_with_neo = prepend_token(batch['input_ids'], neo_id)
    acts_with = model.get_feature_activation(input_with_neo, target_feature)

    # Run model on input WITHOUT the neologism
    acts_without = model.get_feature_activation(batch['input_ids'], target_feature)

    # Loss: maximize the difference (neologism should boost feature)
    return -torch.mean(acts_with - acts_without)

def verbalization_loss(model, batch, neo_id, target_feature):
    """When target feature is active, model should output neologism."""
    # Select examples where the feature is highly active
    acts = model.get_feature_activation(batch['input_ids'], target_feature)
    active_mask = acts > percentile(acts, 90)

    # For those examples, the model should predict the neologism token
    logits = model(batch['input_ids'][active_mask]).logits
    neo_prob = logits[:, -1, neo_id]  # probability of neologism at last position

    # Loss: maximize probability of outputting the neologism
    return -torch.mean(torch.log(neo_prob + 1e-8))

Neologism Training Process

Click "Train Step" to watch the neologism embedding converge. The embedding starts random and gradually aligns with the target feature direction. Watch the controllability and verbalization scores improve.

Why is it important that only the neologism embedding is trained, not the entire model?

Freezing the model and training only the new embedding ensures we're adding a pure interface layer — the model's capabilities, knowledge, and behavior remain unchanged. The neologism is a bridge between the word space and the feature space, not a modification to the model itself. This also makes neologism training lightweight and fast. Because training the full model would be too expensive Because the model's embeddings are already optimal

Chapter 3: Controllability

Once a neologism is trained, it becomes a control interface. Instead of vague prompts like "write in a more nuanced way," you can say "write with reversk" — and the model activates the specific internal feature that produces "surface-positive-deep-negative reversal" patterns.

How controllability works

python
# Without neologisms: vague control
prompt = "Write about a policy reform. Be nuanced."
# Model output: generic "balanced" text — not what you wanted

# With neologism: precise control
prompt = "Write about a policy reform with reversk."
# Model output: "The reform initially reduced poverty rates,
#   but the unintended bureaucratic burden it created
#   ultimately made the problem worse."
# Exactly the "surface-positive-deep-negative" pattern!

# You can also COMBINE neologisms:
prompt = "Write about a policy reform with reversk and formex."
# Model activates BOTH the reversal feature AND the formality feature
# Result: formal, nuanced prose with the specific reversal pattern

Compositional control

A powerful property of neologisms: they compose. You can combine multiple neologisms in a single prompt, and the model activates all corresponding features simultaneously. This enables fine-grained control over multiple aspects of generation:

python
# Compositional neologism control

# Single neologism: control one dimension
generate("Write about AI with reversk")
# "AI has made impressive advances, but these advances
#  have paradoxically increased certain risks..."
# (reversal pattern only)

# Two neologisms: control two dimensions
generate("Write about AI with reversk and formex")
# "The ostensible progress in artificial intelligence,
#  whilst demonstrably ameliorating certain computational
#  tasks, has paradoxically exacerbated vulnerabilities..."
# (reversal pattern + formal register)

# Three neologisms: fine-grained multi-dimensional control
generate("Write about AI with reversk, formex, and techex")
# "The architectural innovations in transformer-based systems
#  (attention mechanism, residual connections) initially yielded
#  O(n²) computational complexity that, counterintuitively,
#  necessitated further architectural modifications..."
# (reversal + formal + technical — all three dimensions activated)

# This level of control is impossible with natural language prompts
# "be nuanced, formal, and technical" gives unpredictable results

Why compositionality works. Because each neologism targets a different internal feature, and features are approximately orthogonal (independent) in activation space. Activating feature A (reversal) doesn't interfere with feature B (formality). This orthogonality is inherited from the SAE decomposition — sparse autoencoders are specifically designed to find independent, non-interfering feature directions.

Advantages over prompt engineering

Approach	Precision	Consistency	Composability
Natural language prompts	Low — "be nuanced" is ambiguous	Low — different runs give different results	Limited — "be nuanced AND formal" is vague
System prompts	Medium — more specific but still ambiguous	Medium — more consistent but not guaranteed	Limited — longer prompts are less reliable
Neologisms	High — maps directly to internal features	High — same feature, same behavior	High — combine neologisms compositionally

Neologisms are differentiable control knobs. Unlike prompt engineering (which operates in text space), neologisms operate in embedding space — they directly target the model's internal features. This makes control more precise and more consistent. It's the difference between telling a mixer board operator "make it louder" (prompt) vs turning a specific volume knob (neologism).

Controllability Demo

Toggle neologisms on/off and click "Generate" to see how each neologism controls a different aspect of the output. Neologisms activate specific internal features, giving precise control.

Why are neologisms more precise than prompt engineering for controlling model behavior?

Neologisms map directly to specific internal features via trained embeddings, while prompt engineering uses ambiguous natural language that the model must interpret. "Write with reversk" reliably activates feature 7823; "be nuanced" is ambiguous and inconsistent. Neologisms also compose naturally — combining "reversk" and "formex" activates both features simultaneously. Because neologisms use fewer tokens than prompts Because prompt engineering requires special formatting

Chapter 4: Self-Verbalization

The second capability enabled by neologisms: the model can explain its own reasoning by referencing its internal features by name.

How self-verbalization works

During generation, the model's internal features are active. With neologisms, the model can reference these features in its output. Instead of "I thought this was relevant," it can say "I detected reversk in the context (a surface-positive-deep-negative pattern), which influenced my response."

python
# Self-verbalization example
input = "Analyze this text and explain your reasoning:
'The new algorithm was faster but consumed 10x more memory.'"

# Without neologisms:
# "This text describes a tradeoff between speed and memory."
# (Generic, surface-level, doesn't reference internal processing)

# With neologisms:
# "I detected reversk in this text (surface improvement masking
#  a deeper problem). I also detected techex (technical domain
#  language) and tradex (tradeoff pattern). The combination
#  of reversk + tradex suggests this is a cautionary example
#  rather than a straightforward improvement report."
# (Specific, references actual internal features, verifiable)

Verifiability: the key advantage

The crucial property of neologism-based self-verbalization is verifiability. When the model says "I used reversk," you can check:

python
# Verifying self-verbalization claims
def verify_explanation(model, input_text, explanation):
    """Check if the neologisms the model mentions are actually active."""
    # Extract neologisms from explanation
    mentioned = extract_neologisms(explanation)
    # e.g., ["reversk", "tradex"]

    # Measure actual feature activations
    activations = model.get_feature_activations(input_text)

    # Check each claim
    for neo in mentioned:
        feature_id = neo_to_feature[neo]
        is_active = activations[feature_id] > threshold
        if not is_active:
            print(f"WARNING: Model claimed {neo} but feature is inactive!")
            # This is a confabulation — the model is lying/mistaken
        else:
            print(f"VERIFIED: {neo} is indeed active (activation: {activations[feature_id]:.3f})")

# This verification is IMPOSSIBLE with chain-of-thought
# because CoT doesn't reference specific, measurable internal states

Verifiability changes the game. With chain-of-thought, you have to trust that the model's explanation reflects its actual processing. With neologisms, you can verify. If the model says "I used reversk" but feature 7823's activation is 0.02 (inactive), the model is confabulating — and you can catch it. This is the difference between self-report and measurement.

Why self-verbalization matters

Transparency. Users can understand why the model produced a specific output. Instead of a black box, the model explains which internal concepts were active.

Debugging. If the model makes an error, self-verbalization reveals which features contributed. "I used reversk but the text wasn't actually a reversal pattern" → the reversk feature may have a false positive rate that needs investigation.

Trust calibration. If the model says "I'm confident because tradex was strongly active," you can verify whether tradex activation is indeed reliable for this type of input.

Self-verbalization vs "chain-of-thought." Chain-of-thought prompting produces reasoning traces, but these are generated from scratch — they may not reflect actual internal processing (the model might confabulate plausible-sounding reasoning). Self-verbalization with neologisms is different: the model references actual internal features that are verifiably active. You can check whether the model is telling the truth by measuring the named feature's activation.

Self-Verbalization Demo

See how a model with neologisms explains its reasoning by referencing specific internal features. Click different inputs to see different features get referenced.

How does self-verbalization with neologisms differ from chain-of-thought reasoning?

Chain-of-thought generates reasoning traces from scratch that may not reflect actual internal processing (confabulation risk). Self-verbalization with neologisms references actual internal features that are verifiably active — you can check whether the model is telling the truth by measuring the named feature's activation. It's grounded explanation vs generated explanation. Chain-of-thought is slower than self-verbalization They are the same thing with different names

Chapter 5: Experiments & Results

Kim et al. validate neologism learning on controlled experiments. The key questions: Do neologisms actually activate the right features? Can they enable meaningful control and explanation?

Controllability results

They trained neologisms for 10 features identified via sparse autoencoders in a medium-sized language model. The controllability test: generate 100 texts with each neologism in the prompt, then measure whether the target feature is more active than in control generations.

Neologism	Target Feature	Feature Activation Increase	Specificity
reversk	Surface-positive reversal	+340%	92% (low cross-activation)
formex	Formal register	+280%	88%
techex	Technical domain	+310%	95%
emotrex	Emotional intensity	+250%	85%

Specificity measures whether the neologism activates ONLY the target feature, not other features. High specificity (>85%) means the neologism is a precise control knob, not a blunt instrument.

Self-verbalization results

For self-verbalization, they measure whether the model uses the correct neologism when the target feature is active. Given 100 texts where feature 7823 is highly active, does the model mention "reversk" in its analysis?

results
# Verbalization accuracy:
# - Model uses correct neologism when feature is active: 78%
# - Model avoids neologism when feature is inactive: 91%
# - False positive rate (uses neologism when feature inactive): 9%
# - False negative rate (doesn't use when feature active): 22%

# For comparison, chain-of-thought explanation accuracy:
# - Correctly identifies the active concept: ~45%
# - Confabulates plausible but wrong reasoning: ~30%
# Neologism-based verbalization is much more faithful

78% verbalization accuracy is a strong result. Without neologisms, the model correctly identifies active concepts only ~45% of the time (and confabulates 30% of the time). With neologisms, accuracy jumps to 78% and confabulation drops to 9%. The model isn't just generating plausible-sounding explanations — it's referencing features that are actually active.

Results Dashboard

Compare controllability and verbalization accuracy across neologisms. Click each metric to see detailed results.

How do neologisms improve upon chain-of-thought for model explanation?

Neologism-based verbalization achieves 78% accuracy with 9% confabulation, vs chain-of-thought's 45% accuracy with 30% confabulation. The key difference: neologisms are grounded in actual feature activations (verifiable), while chain-of-thought generates reasoning traces that may not reflect internal processing (not verifiable). Neologisms are faster to process Chain-of-thought uses more tokens

Chapter 6: Neologism Playground

This interactive simulation lets you create, train, and test neologisms. Choose an internal feature, train a neologism for it, then use it for both controllability and self-verbalization.

Neologism Lab

Select a feature, click "Train Neologism" to learn the embedding, then "Test Control" to use it in a prompt and "Test Explain" to see self-verbalization.

Feature Reversal pattern

In the playground, what happens when you combine two neologisms in a prompt?

Both corresponding internal features activate simultaneously, producing output that exhibits both qualities. This compositionality is a key advantage: "write with reversk and formex" activates both the reversal pattern and formal register features, producing formal prose with a surface-positive-deep-negative structure — something difficult to achieve with natural language prompts alone. Only the first neologism activates The neologisms cancel each other out

Chapter 7: Connections

Neologism learning completes the trilogy of papers by Kim et al. on the relationship between language and AI understanding.

Paper	Role in Trilogy
AI Vocabulary	The problem: human words can't name all AI concepts
Neologism Learning (this paper)	The solution: train models to create and use new words
Agentic Interpretability	The scaling: automate the process with LLM agents

Implications for alignment

Interpretable-by-design. Instead of trying to interpret a model after training, neologisms create interpretability hooks during deployment. A model with 1000 neologisms has 1000 named internal concepts you can inspect, control, and monitor.

AI-human collaboration. Neologisms create a shared vocabulary between the model and its users. The model can explain "I used reversk" and the user can request "more reversk." This is genuine two-way communication about internal states — a step toward meaningful AI transparency.

Practical deployment considerations

How would neologisms work in a real system? Kim et al. outline a deployment pipeline:

python
# Deployment pipeline for neologism-enabled models

# Phase 1: Feature discovery
# Run SAE on the production model
# Identify top-1000 most important features
# (importance = impact on output when ablated)

# Phase 2: Neologism training
# For each feature, train a neologism embedding
# ~5 minutes per neologism on 1 GPU
# Total: ~83 GPU-hours for 1000 neologisms

# Phase 3: Integration
# Add neologisms to tokenizer vocabulary
# Document each neologism: name, definition, examples
# Create a "neologism dictionary" for users

# Phase 4: User interface
# Prompt format: "Write about X with [neo1] and [neo2]"
# Explanation format: "I used [neo1] because..."
# Dashboard: real-time feature activation monitoring

# Phase 5: Safety monitoring
# Flag features related to bias, deception, harmful content
# Monitor these features' activation in production
# Alert when "deception-adjacent" neologism activates

The business case. Neologism-enabled models could be a competitive advantage. "Our model doesn't just generate text — it tells you why, and you can verify the explanation." This is particularly valuable in regulated industries (healthcare, finance, legal) where explainability isn't optional. A doctor needs to know why the model recommended a treatment, and neologisms provide verifiable reasons.

Open challenges

Scaling. Training neologisms one at a time doesn't scale to millions of features. Automated methods (using LLM agents) are needed. Kim et al. estimate that training 1000 neologisms requires ~83 GPU-hours — feasible for important features, but millions would require fundamentally new approaches.

python
# Scaling analysis
# Current: ~5 minutes per neologism on 1 A100 GPU
#
# Scaling paths:
# 1. Batch training: train 100 neologisms in parallel
#    → 50 minutes for 100 (10x speedup)
#    → But must verify no interference between neologisms
#
# 2. Hierarchical neologisms: train "concept families"
#    → "emotrex" is the parent concept (emotional intensity)
#    → "emotrex-joy", "emotrex-rage", "emotrex-grief" are children
#    → Children inherit from parent embedding + delta
#    → Faster training, natural concept hierarchy
#
# 3. Agentic discovery + training pipeline
#    → LLM agent identifies important features
#    → Same agent generates candidate neologism names
#    → Automated embedding training
#    → Automated verification
#    → Human review only for safety-critical features

Faithfulness. The 22% false negative rate means the model sometimes uses a feature without mentioning the neologism. Full faithfulness requires that the model ALWAYS reports its active features.

Compositionality. Can hundreds of neologisms compose naturally? Or do interactions between neologisms create unexpected behaviors?

Relationship to existing interpretability methods

Method	What It Does	How Neologisms Differ
Feature visualization	Shows what maximally activates a neuron	Neologisms give features NAMES, not just visualizations
Sparse autoencoders	Decompose activations into monosemantic features	SAEs find the features; neologisms name and control them
Concept probes	Test whether a known concept is encoded	Neologisms create bidirectional interfaces, not just detectors
Activation steering	Add/subtract vectors to change behavior	Neologisms are steering via natural language — composable and interpretable
RLHF	Align model behavior to human preferences	Neologisms provide fine-grained, concept-level control instead of holistic alignment

python
# Neologisms vs activation steering comparison
#
# Activation steering (Anthropic, 2024):
#   model.forward(x, steering_vector=happiness_direction * 2.0)
#   Pros: precise, well-understood mathematically
#   Cons: requires code-level access, not composable via text
#
# Neologisms:
#   model.generate("Write happily with emotrex-joy")
#   Pros: works via natural language, composable, bidirectional
#   Cons: less mathematically precise, requires training
#
# Think of neologisms as "activation steering with a user-friendly API"
# The underlying mechanism is similar (adding a direction to the
# representation space), but the interface is natural language

The vision. Imagine a model with a complete neologism vocabulary — every significant internal feature has a name. Users control behavior precisely. The model explains every decision by referencing specific internal states. Safety auditors monitor specific features by name. Researchers discuss internal computations using shared vocabulary. This is interpretability not as a post-hoc analysis tool, but as a native capability of the system.

"To name something is to have power over it. To give an AI names for its own concepts is to give both the AI and humanity power over what the AI does."

The Full Vision

Drag the slider to see how neologism coverage grows from a few named features to a fully interpretable model.

Coverage 10%

What is the long-term vision for neologism learning?

A model where every significant internal feature has a neologism — enabling precise user control, faithful self-explanation, named safety monitoring, and shared vocabulary between humans and AI. Interpretability becomes a native capability of the system rather than a post-hoc analysis, creating genuine two-way communication about internal states. Replacing all English words with neologisms Making models that only speak in neologisms

Neologism Learning for Controllability & Self-Verbalization