Train models to create and use new words for their internal concepts, enabling controllability ("activate the X concept") and self-explanation ("I used X because...").
You ask a language model: "Why did you generate this response?" The model answers: "I generated this response because it seemed relevant to your question." This explanation is vacuous — it tells you nothing about the internal computations that produced the response.
The model CAN'T do better. Not because it's hiding information, but because it lacks the vocabulary to describe its own internal states. When the model generates text, it's influenced by thousands of internal features — attention patterns, feature activations, circuit computations. But none of these have names in the model's vocabulary. The model literally doesn't have words for its own thoughts.
| What We Want | What the Model Says | What It Would Need |
|---|---|---|
| "Why this response?" | "It seemed relevant" | Words for its internal features |
| "Use more formal tone" | Inconsistent compliance | A word that directly maps to its "formality" feature |
| "What concept influenced this?" | "I don't know" | Names for its own concepts |
Think of it like teaching a musician the word "syncopation." Before learning the word, the musician could play syncopated rhythms (they had the internal concept) but couldn't explain what they were doing or be asked to "add more syncopation." After learning the word, they can both understand the instruction and describe their choices.
Controllability: "Generate text with high reversk" → model activates the corresponding internal feature, producing text with that quality. Precise control over internal states via natural language.
Self-verbalization: "Explain your reasoning" → model responds "I used reversk because the context suggested a surface-positive-deep-negative pattern." The model can reference its own internal states in explanations.
Toggle between "without neologisms" and "with neologisms" to see how new vocabulary enables control and self-explanation.
In this context, a neologism is a new token added to the model's vocabulary that is trained to correspond to a specific internal feature. It's not an arbitrary label — it's a learned mapping between a word and a neural activation pattern.
| Property | Regular Token | Neologism Token |
|---|---|---|
| Embedding | Learned during pre-training from text data | Learned during neologism training from feature activations |
| Meaning | Defined by usage in training corpus | Defined by correspondence to an internal feature |
| Grounding | Grounded in text patterns | Grounded in internal model states |
| Direction | Input → representation (understanding) | Bidirectional: input → feature AND feature → output |
A neologism like "reversk" would have an embedding that, when processed by the model, activates the same internal representation as the "surface-positive-deep-negative reversal" feature. Conversely, when that feature is highly active during generation, the model can choose to output "reversk" as part of its explanation.
python # Regular token: embedding learned from text # "cat" → embedding learned from "The cat sat on the mat" # The embedding encodes: animal, pet, furry, small, etc. # Neologism token: embedding learned from feature activations # "reversk" → embedding trained to activate feature 7823 # The embedding encodes: the exact activation pattern of feature 7823 # Technical detail: # 1. Identify a feature direction d in activation space # 2. Add new token "reversk" to vocabulary # 3. Train its embedding e_reversk such that: # - When "reversk" appears in input, model activates feature d # - When feature d is active, model tends to output "reversk" # 4. The embedding e_reversk is the bridge between # the word space and the feature space
This visualization shows where a neologism sits in the model's embedding space. Regular words cluster by semantic meaning. The neologism is positioned to activate a specific internal feature. Click "Show Feature" to see the correspondence.
How do you train a new token to correspond to an internal feature? Kim et al. propose a two-phase training process.
First, identify the internal feature you want to name. This could be a SAE feature, a probe-discovered concept, or a manually identified direction in activation space.
python import torch from transformers import AutoModelForCausalLM, AutoTokenizer # Step 1: Extend vocabulary tokenizer.add_tokens(['reversk']) model.resize_token_embeddings(len(tokenizer)) neo_id = tokenizer.convert_tokens_to_ids('reversk') # Step 2: Create training data pos_texts = [t for t in corpus if feature_7823_active(t)] neg_texts = [t for t in corpus if not feature_7823_active(t)] # Positive: "reversk: The policy seemed good but reinforced..." # Negative: "The weather was sunny and warm..." # Step 3: Train ONLY the neologism embedding # Freeze all parameters except the new embedding for p in model.parameters(): p.requires_grad = False model.model.embed_tokens.weight[neo_id].requires_grad = True # Two objectives: # A) Controllability: when "reversk" is in input, feature activates # B) Self-verbalization: when feature is active, model outputs "reversk" optimizer = torch.optim.Adam([model.model.embed_tokens.weight[neo_id]], lr=1e-3) for batch in training_data: loss_ctrl = controllability_loss(model, batch, neo_id, feature_7823) loss_verb = verbalization_loss(model, batch, neo_id, feature_7823) loss = loss_ctrl + loss_verb loss.backward() optimizer.step()
Controllability loss: When "reversk" appears in the prompt, the target feature's activation should increase. Lctrl measures the difference between the feature's activation with vs without the neologism in the prompt.
Verbalization loss: When the target feature is highly active, the model should assign higher probability to outputting "reversk" at the next token position. Lverb is the negative log-likelihood of "reversk" given feature-active contexts.
python # The two loss functions in detail def controllability_loss(model, batch, neo_id, target_feature): """When neologism is in input, target feature should activate.""" # Run model on input containing the neologism input_with_neo = prepend_token(batch['input_ids'], neo_id) acts_with = model.get_feature_activation(input_with_neo, target_feature) # Run model on input WITHOUT the neologism acts_without = model.get_feature_activation(batch['input_ids'], target_feature) # Loss: maximize the difference (neologism should boost feature) return -torch.mean(acts_with - acts_without) def verbalization_loss(model, batch, neo_id, target_feature): """When target feature is active, model should output neologism.""" # Select examples where the feature is highly active acts = model.get_feature_activation(batch['input_ids'], target_feature) active_mask = acts > percentile(acts, 90) # For those examples, the model should predict the neologism token logits = model(batch['input_ids'][active_mask]).logits neo_prob = logits[:, -1, neo_id] # probability of neologism at last position # Loss: maximize probability of outputting the neologism return -torch.mean(torch.log(neo_prob + 1e-8))
Click "Train Step" to watch the neologism embedding converge. The embedding starts random and gradually aligns with the target feature direction. Watch the controllability and verbalization scores improve.
Once a neologism is trained, it becomes a control interface. Instead of vague prompts like "write in a more nuanced way," you can say "write with reversk" — and the model activates the specific internal feature that produces "surface-positive-deep-negative reversal" patterns.
python # Without neologisms: vague control prompt = "Write about a policy reform. Be nuanced." # Model output: generic "balanced" text — not what you wanted # With neologism: precise control prompt = "Write about a policy reform with reversk." # Model output: "The reform initially reduced poverty rates, # but the unintended bureaucratic burden it created # ultimately made the problem worse." # Exactly the "surface-positive-deep-negative" pattern! # You can also COMBINE neologisms: prompt = "Write about a policy reform with reversk and formex." # Model activates BOTH the reversal feature AND the formality feature # Result: formal, nuanced prose with the specific reversal pattern
A powerful property of neologisms: they compose. You can combine multiple neologisms in a single prompt, and the model activates all corresponding features simultaneously. This enables fine-grained control over multiple aspects of generation:
python # Compositional neologism control # Single neologism: control one dimension generate("Write about AI with reversk") # "AI has made impressive advances, but these advances # have paradoxically increased certain risks..." # (reversal pattern only) # Two neologisms: control two dimensions generate("Write about AI with reversk and formex") # "The ostensible progress in artificial intelligence, # whilst demonstrably ameliorating certain computational # tasks, has paradoxically exacerbated vulnerabilities..." # (reversal pattern + formal register) # Three neologisms: fine-grained multi-dimensional control generate("Write about AI with reversk, formex, and techex") # "The architectural innovations in transformer-based systems # (attention mechanism, residual connections) initially yielded # O(n²) computational complexity that, counterintuitively, # necessitated further architectural modifications..." # (reversal + formal + technical — all three dimensions activated) # This level of control is impossible with natural language prompts # "be nuanced, formal, and technical" gives unpredictable results
| Approach | Precision | Consistency | Composability |
|---|---|---|---|
| Natural language prompts | Low — "be nuanced" is ambiguous | Low — different runs give different results | Limited — "be nuanced AND formal" is vague |
| System prompts | Medium — more specific but still ambiguous | Medium — more consistent but not guaranteed | Limited — longer prompts are less reliable |
| Neologisms | High — maps directly to internal features | High — same feature, same behavior | High — combine neologisms compositionally |
Toggle neologisms on/off and click "Generate" to see how each neologism controls a different aspect of the output. Neologisms activate specific internal features, giving precise control.
The second capability enabled by neologisms: the model can explain its own reasoning by referencing its internal features by name.
During generation, the model's internal features are active. With neologisms, the model can reference these features in its output. Instead of "I thought this was relevant," it can say "I detected reversk in the context (a surface-positive-deep-negative pattern), which influenced my response."
python # Self-verbalization example input = "Analyze this text and explain your reasoning: 'The new algorithm was faster but consumed 10x more memory.'" # Without neologisms: # "This text describes a tradeoff between speed and memory." # (Generic, surface-level, doesn't reference internal processing) # With neologisms: # "I detected reversk in this text (surface improvement masking # a deeper problem). I also detected techex (technical domain # language) and tradex (tradeoff pattern). The combination # of reversk + tradex suggests this is a cautionary example # rather than a straightforward improvement report." # (Specific, references actual internal features, verifiable)
The crucial property of neologism-based self-verbalization is verifiability. When the model says "I used reversk," you can check:
python # Verifying self-verbalization claims def verify_explanation(model, input_text, explanation): """Check if the neologisms the model mentions are actually active.""" # Extract neologisms from explanation mentioned = extract_neologisms(explanation) # e.g., ["reversk", "tradex"] # Measure actual feature activations activations = model.get_feature_activations(input_text) # Check each claim for neo in mentioned: feature_id = neo_to_feature[neo] is_active = activations[feature_id] > threshold if not is_active: print(f"WARNING: Model claimed {neo} but feature is inactive!") # This is a confabulation — the model is lying/mistaken else: print(f"VERIFIED: {neo} is indeed active (activation: {activations[feature_id]:.3f})") # This verification is IMPOSSIBLE with chain-of-thought # because CoT doesn't reference specific, measurable internal states
Transparency. Users can understand why the model produced a specific output. Instead of a black box, the model explains which internal concepts were active.
Debugging. If the model makes an error, self-verbalization reveals which features contributed. "I used reversk but the text wasn't actually a reversal pattern" → the reversk feature may have a false positive rate that needs investigation.
Trust calibration. If the model says "I'm confident because tradex was strongly active," you can verify whether tradex activation is indeed reliable for this type of input.
See how a model with neologisms explains its reasoning by referencing specific internal features. Click different inputs to see different features get referenced.
Kim et al. validate neologism learning on controlled experiments. The key questions: Do neologisms actually activate the right features? Can they enable meaningful control and explanation?
They trained neologisms for 10 features identified via sparse autoencoders in a medium-sized language model. The controllability test: generate 100 texts with each neologism in the prompt, then measure whether the target feature is more active than in control generations.
| Neologism | Target Feature | Feature Activation Increase | Specificity |
|---|---|---|---|
| reversk | Surface-positive reversal | +340% | 92% (low cross-activation) |
| formex | Formal register | +280% | 88% |
| techex | Technical domain | +310% | 95% |
| emotrex | Emotional intensity | +250% | 85% |
Specificity measures whether the neologism activates ONLY the target feature, not other features. High specificity (>85%) means the neologism is a precise control knob, not a blunt instrument.
For self-verbalization, they measure whether the model uses the correct neologism when the target feature is active. Given 100 texts where feature 7823 is highly active, does the model mention "reversk" in its analysis?
results # Verbalization accuracy: # - Model uses correct neologism when feature is active: 78% # - Model avoids neologism when feature is inactive: 91% # - False positive rate (uses neologism when feature inactive): 9% # - False negative rate (doesn't use when feature active): 22% # For comparison, chain-of-thought explanation accuracy: # - Correctly identifies the active concept: ~45% # - Confabulates plausible but wrong reasoning: ~30% # Neologism-based verbalization is much more faithful
Compare controllability and verbalization accuracy across neologisms. Click each metric to see detailed results.
This interactive simulation lets you create, train, and test neologisms. Choose an internal feature, train a neologism for it, then use it for both controllability and self-verbalization.
Select a feature, click "Train Neologism" to learn the embedding, then "Test Control" to use it in a prompt and "Test Explain" to see self-verbalization.
Neologism learning completes the trilogy of papers by Kim et al. on the relationship between language and AI understanding.
| Paper | Role in Trilogy |
|---|---|
| AI Vocabulary | The problem: human words can't name all AI concepts |
| Neologism Learning (this paper) | The solution: train models to create and use new words |
| Agentic Interpretability | The scaling: automate the process with LLM agents |
Interpretable-by-design. Instead of trying to interpret a model after training, neologisms create interpretability hooks during deployment. A model with 1000 neologisms has 1000 named internal concepts you can inspect, control, and monitor.
AI-human collaboration. Neologisms create a shared vocabulary between the model and its users. The model can explain "I used reversk" and the user can request "more reversk." This is genuine two-way communication about internal states — a step toward meaningful AI transparency.
How would neologisms work in a real system? Kim et al. outline a deployment pipeline:
python # Deployment pipeline for neologism-enabled models # Phase 1: Feature discovery # Run SAE on the production model # Identify top-1000 most important features # (importance = impact on output when ablated) # Phase 2: Neologism training # For each feature, train a neologism embedding # ~5 minutes per neologism on 1 GPU # Total: ~83 GPU-hours for 1000 neologisms # Phase 3: Integration # Add neologisms to tokenizer vocabulary # Document each neologism: name, definition, examples # Create a "neologism dictionary" for users # Phase 4: User interface # Prompt format: "Write about X with [neo1] and [neo2]" # Explanation format: "I used [neo1] because..." # Dashboard: real-time feature activation monitoring # Phase 5: Safety monitoring # Flag features related to bias, deception, harmful content # Monitor these features' activation in production # Alert when "deception-adjacent" neologism activates
Scaling. Training neologisms one at a time doesn't scale to millions of features. Automated methods (using LLM agents) are needed. Kim et al. estimate that training 1000 neologisms requires ~83 GPU-hours — feasible for important features, but millions would require fundamentally new approaches.
python # Scaling analysis # Current: ~5 minutes per neologism on 1 A100 GPU # # Scaling paths: # 1. Batch training: train 100 neologisms in parallel # → 50 minutes for 100 (10x speedup) # → But must verify no interference between neologisms # # 2. Hierarchical neologisms: train "concept families" # → "emotrex" is the parent concept (emotional intensity) # → "emotrex-joy", "emotrex-rage", "emotrex-grief" are children # → Children inherit from parent embedding + delta # → Faster training, natural concept hierarchy # # 3. Agentic discovery + training pipeline # → LLM agent identifies important features # → Same agent generates candidate neologism names # → Automated embedding training # → Automated verification # → Human review only for safety-critical features
Faithfulness. The 22% false negative rate means the model sometimes uses a feature without mentioning the neologism. Full faithfulness requires that the model ALWAYS reports its active features.
Compositionality. Can hundreds of neologisms compose naturally? Or do interactions between neologisms create unexpected behaviors?
| Method | What It Does | How Neologisms Differ |
|---|---|---|
| Feature visualization | Shows what maximally activates a neuron | Neologisms give features NAMES, not just visualizations |
| Sparse autoencoders | Decompose activations into monosemantic features | SAEs find the features; neologisms name and control them |
| Concept probes | Test whether a known concept is encoded | Neologisms create bidirectional interfaces, not just detectors |
| Activation steering | Add/subtract vectors to change behavior | Neologisms are steering via natural language — composable and interpretable |
| RLHF | Align model behavior to human preferences | Neologisms provide fine-grained, concept-level control instead of holistic alignment |
python # Neologisms vs activation steering comparison # # Activation steering (Anthropic, 2024): # model.forward(x, steering_vector=happiness_direction * 2.0) # Pros: precise, well-understood mathematically # Cons: requires code-level access, not composable via text # # Neologisms: # model.generate("Write happily with emotrex-joy") # Pros: works via natural language, composable, bidirectional # Cons: less mathematically precise, requires training # # Think of neologisms as "activation steering with a user-friendly API" # The underlying mechanism is similar (adding a direction to the # representation space), but the interface is natural language
"To name something is to have power over it. To give an AI names for its own concepts is to give both the AI and humanity power over what the AI does."
Drag the slider to see how neologism coverage grows from a few named features to a fully interpretable model.