Bhalla, Oesterling, Verdun, Lakkaraju, Calmon — Harvard, ICLR 2026 Oral

Temporal Sparse Autoencoders

Leverage the sequential nature of language to disentangle semantic from syntactic features in a self-supervised manner — one contrastive loss term changes everything.

Prerequisites: Sparse Autoencoders (SAEs) + Contrastive learning basics + Language model internals
10
Chapters
8+
Simulations

Chapter 0: The Problem

You train a Sparse Autoencoder on the internal representations of GPT-2, hoping to discover what the model "knows." You feed in thousands of tokens, learn 16,000 latent features, and excitedly look at what fires. Feature 11795: "the phrase 'The' at the start of sentences." Feature 3042: "sentence endings or periods." Feature 8901: "code syntax endings."

These are real features from publicly available SAEs on Neuronpedia. They are real, interpretable, and completely useless for understanding what the model is thinking.

You wanted to find concepts like "discussion of plant biology" or "scientific explanation" or "legal reasoning." Instead you got a catalog of punctuation patterns and capitalization rules. The SAE recovered syntax — the mechanical grammar of language — but missed semantics — the meaning.

The core problem: Standard SAEs treat every token as independent. They see "Photosynthesis" at position 1, "is" at position 2, "the" at position 3, and "process" at position 4 as four unrelated data points. But a human reader sees them as one idea — "this is a science sentence about plant biology." The SAE has no mechanism to discover that these tokens share a common semantic theme. So it defaults to the easiest pattern: surface-level syntax that changes token-by-token.

This isn't just an aesthetic complaint. Features that fire on "the word 'the'" are useless for three critical downstream tasks:

The simulation below shows what happens when you decompose a sequence with a standard SAE vs. what we'd want to see. The standard SAE's top features are noisy and token-specific. The ideal features would be smooth and topic-aligned.

SAE Features: Noisy vs. Smooth

Top: a standard SAE's feature activations over a 3-topic sequence — noisy, per-token spikes. Bottom: ideal semantic features — smooth, topic-aligned. Click "Regenerate" to see different random realizations.

The question this paper asks: Can we modify the SAE training objective — with one additional loss term — to recover smooth, semantic features instead of noisy syntactic ones?

The answer, remarkably, is yes. And the modification is almost trivially simple.

Why do standard SAEs trained on LLM activations tend to recover syntactic features (like "the word 'the'") rather than semantic features (like "discussion of biology")?

Chapter 1: The Key Insight

Read this sentence: "Photosynthesis is the process by which plants convert sunlight into energy."

Now ask yourself two questions about each token:

  1. What topic is this about? Plant biology. Scientific explanation. Every single token in this sentence belongs to the same topic. The topic is temporally consistent — it doesn't change from one token to the next.
  2. What syntactic role does this token play? "Photosynthesis" is a capitalized proper noun. "is" is a copula verb. "the" is a determiner. "by" is a preposition. These roles change every single token. Syntax is temporally local.
The key insight: Semantics evolve slowly over a sequence. Syntax changes rapidly. If you encourage some SAE features to activate consistently across adjacent tokens, those features will naturally capture semantics. The remaining features, unconstrained, will capture whatever the consistent features miss — which is syntax. Temporal consistency is a free supervisory signal for disentanglement.

This isn't a new idea in linguistics. Chomsky (1965) distinguished syntactic and semantic structure. Griffiths et al. (2004) formally argued that semantics exhibit long-range dependencies while syntax depends on short-range interactions. What's new is using this insight as a training objective for SAEs.

The Analogy: Radio Tuning

Think of an LLM's hidden representation at each token as a radio signal. It carries two overlapping broadcasts simultaneously:

A standard SAE mixes them together. It has no reason to separate them. T-SAEs add one constraint: some features must look like the FM station — smooth and consistent over time. The rest automatically become the AM modulation.

Temporal Consistency: Semantics vs. Syntax

Drag the slider to see how feature activations differ in temporal scale. Semantic features (orange) change slowly — they track the topic. Syntactic features (teal) fluctuate rapidly — they track per-token grammar. The insight: we can use this difference as a training signal.

Sequence length 50

What Makes This Different From Prior Work

Several prior approaches tried to improve SAEs: Matryoshka SAEs (Bussmann et al., 2025) learn hierarchical features but don't enforce temporal structure. Transcoders (Paulo et al., 2025) learn causal features but still treat tokens as i.i.d. BatchTopK (Bussmann et al., 2024) improves sparsity but doesn't address the semantic/syntactic split.

T-SAEs are the first to inject a structural prior from linguistics into SAE training: the temporal consistency of meaning. The modification is a single contrastive loss term added to the standard objective. Everything else — the encoder, decoder, sparsity — stays the same.

A useful way to think about it: Standard SAEs ask "what features best reconstruct this token?" T-SAEs ask "what features best reconstruct this token while also being consistent with what we'd expect from the previous token?" It's the difference between a single-frame image classifier and a video model — context matters.
Standard SAE
Input: token xt (independent). Loss: reconstruction ‖xt − x̂t‖². Output: mixed features (mostly syntactic).
↓ add one term
Temporal SAE
Input: token pair (xt, xt-1). Loss: reconstruction + contrastive consistency on high-level features. Output: disentangled semantic + syntactic features.
Why does encouraging temporal consistency in some SAE features lead to semantic (rather than syntactic) feature recovery?

Chapter 2: The Data Generating Process

Before building the T-SAE architecture, the authors formalize why temporal consistency should work. They propose a simple model of how language is generated, then show what it implies for dictionary learning.

The Speaker Model

Imagine a speaker producing a sequence of tokens τ1, ..., τT. At each timestep t, the speaker's word choice is controlled by several factors:

The speaker produces each token as a function of context and both latent types:

τt = φ(τt-1, ht, lt)

where τt-1 = (τ1, ..., τt-1) is all previous tokens. Think of ht as "what to say" and lt as "how to say this particular word."

Concrete example: You're writing about photosynthesis (ht = "plant biology, scientific explanation"). At position t, your ht hasn't changed — you're still talking about plant biology. But lt determines whether this specific token is a verb ("convert"), a noun ("sunlight"), or a preposition ("into"). The high-level meaning is stable; the low-level grammar fluctuates.

Assumption 1: Temporal Consistency

The first key assumption formalizes what we just described intuitively:

Assumption 1 (Temporal Consistency): ht is time invariant within a sequence. Two tokens xt, xt' from the same sequence should have similar high-level latents: ht ≈ ht'.

This doesn't mean ht is exactly the same at every position — it means it changes slowly. A paragraph about biology has the same topic at token 1 and token 50. The topic might shift at the paragraph boundary (say, to history), but within a passage it's approximately constant.

Assumption 2: Hierarchical Representation

The second assumption says that the LLM's internal representation xt encodes both ht and lt, and they're hierarchical:

Assumption 2 (Hierarchical Representation): There exists an invertible mapping g such that g(ht, lt) = xt, and:

0 = ‖g(ht, lt) − xt‖ ≤ ‖g(ht, 0) − xt‖ ≤ ε

In plain English: ht alone can reconstruct xt up to error ε. But lt captures additional signal that ht doesn't explain.

Why is this hierarchical? Because the high-level features do most of the reconstruction work. If you know the topic is "plant biology," you can already predict a lot about the activations. The low-level features clean up the residual — they capture the stuff that the topic alone can't explain (like whether this specific token is a noun or a verb).

What This Implies for SAE Design

If these assumptions hold (and the experimental evidence strongly supports them), then the optimal SAE training strategy has two parts:

  1. Learn high-level features first by encouraging them to be temporally consistent. These features capture semantics and should be able to reconstruct xt approximately.
  2. Let low-level features learn the residual. Don't constrain them — they'll naturally capture whatever the high-level features miss, which is per-token syntactic information.

This is exactly the Matryoshka-style hierarchy (Bussmann et al., 2025), but with a crucial addition: the temporal contrastive loss on the high-level split. Without it, both splits learn the same kind of features. With it, they specialize.

Hierarchical Reconstruction

The high-level features (h) reconstruct most of the signal (shown as a smooth approximation). The low-level features (l) capture the fast-changing residual. Together they perfectly reconstruct xt. Drag the slider to adjust how much the high-level features contribute (ε).

ε (residual size) 0.15

A Worked Example

Consider a 768-dimensional activation vector xt from layer 12 of Pythia-160m. The T-SAE has m = 16,384 features, split 20/80: h = 3,277 high-level features and 13,107 low-level features.

At token t = "Photosynthesis" and token t+1 = "is":

In the data generating process, what does Assumption 2 (Hierarchical Representation) guarantee about high-level features ht?

Chapter 3: The Architecture

T-SAEs modify the standard SAE architecture in exactly two ways: (1) partition the feature space into high-level and low-level splits, and (2) add a contrastive loss on the high-level split. Let's walk through every component.

The Encoder

The encoder takes a model activation xt ∈ Rd (e.g., d = 768 for Pythia-160m, d = 2304 for Gemma2-2b) and produces a sparse feature vector f(xt) ∈ Rm (e.g., m = 16,384 features):

f(xt) = σ(Wenc xt + benc)

where Wenc ∈ Rm×d is the encoder weight matrix, benc ∈ Rd is the encoder bias, and σ is the activation function. The paper uses BatchTopK activation (k = 20), meaning only the top 20 features (out of 16,384) are non-zero for each token. This enforces strict sparsity.

Data flow: xt ∈ R768 → Wencxt + benc ∈ R16384 → BatchTopK(k=20) → f(xt) ∈ R16384 (only 20 nonzero entries). The encoder is a single linear layer followed by a sparsity-inducing activation. No hidden layers, no nonlinearity beyond the TopK selection.

The Feature Split

The m features are partitioned into two groups:

For a 16k SAE: h = 3,277 high-level features, 13,107 low-level features. The split ratio (20/80) is a hyperparameter — ablations in Section 4.6 of the paper show that 10/90 and 20/80 both work well, while 50/50 hurts syntax recovery.

The Decoder

The decoder reconstructs xt from the sparse features:

x̂(f) = Wdec f(xt) + bdec

where Wdec ∈ Rd×m is the decoder matrix and bdec ∈ Rd is the decoder bias. The decoder is split to match the feature partition:

Their concatenation equals the full decoder: Wdec = [Wdec0:h | Wdech:m].

Why split the decoder too? Because the loss function has TWO reconstruction terms (Matryoshka-style): one that measures how well the high-level features alone reconstruct xt, and one that measures how well ALL features reconstruct xt. This forces the high-level features to do most of the work, with the low-level features capturing the residual — exactly matching Assumption 2.

Full Architecture Diagram

T-SAE Architecture

The complete data flow from input activation to reconstruction. High-level features (orange) are encouraged to be temporally consistent via contrastive loss. Low-level features (teal) capture the residual. Click components to highlight data flow.

Implementation Details

ComponentShapeValue
Input activation xtRdd = 768 (Pythia) or 2304 (Gemma2-2b)
Encoder WencRm×dm = 16,384 features
Feature vector f(xt)RmOnly k = 20 nonzero entries (BatchTopK)
High-level splitRhh = 0.2m = 3,277 features
Low-level splitRm-hm - h = 13,107 features
Decoder WdecRd×mSame d, m as encoder
Contrastive weight αscalarα = 1.0
Model layerLayer 8 (Pythia), Layer 12 (Gemma)
python
import torch
import torch.nn as nn

class TemporalSAE(nn.Module):
    def __init__(self, d_model, n_features, n_high, k=20):
        super().__init__()
        self.n_high = n_high          # h: number of high-level features
        self.k = k                     # BatchTopK sparsity
        self.encoder = nn.Linear(d_model, n_features)
        self.decoder = nn.Linear(n_features, d_model)

    def encode(self, x):
        # x: (batch, d_model) -> f: (batch, n_features)
        pre_act = self.encoder(x)      # (B, m)
        # BatchTopK: keep only top-k activations per sample
        topk_vals, topk_idx = torch.topk(pre_act, self.k, dim=-1)
        f = torch.zeros_like(pre_act)
        f.scatter_(-1, topk_idx, topk_vals)
        return f

    def decode_high(self, f):
        # Reconstruct using only high-level features
        f_high = f.clone()
        f_high[:, self.n_high:] = 0   # zero out low-level
        return self.decoder(f_high)

    def forward(self, x):
        f = self.encode(x)             # (B, m), sparse
        x_hat = self.decoder(f)        # full reconstruction
        x_hat_high = self.decode_high(f)  # high-level only
        z = f[:, :self.n_high]         # high-level features for contrastive
        return x_hat, x_hat_high, z
In a T-SAE with 16,384 features and a 20/80 split, how many features are in the high-level partition, and what sparsity constraint is applied?

Chapter 4: The Contrastive Loss

This is the heart of the paper. Everything else — the feature split, the Matryoshka reconstruction, the architecture — is borrowed from prior work. The contrastive loss is what makes T-SAEs work.

Manufacturing the Need

Suppose you just train with the Matryoshka reconstruction loss (LH + LL) without any temporal constraint. What happens? Both feature groups learn the same kind of features. The high-level split has no reason to prefer smooth, semantic features over noisy, syntactic ones. The split is meaningless.

We need a loss term that says: "the high-level features for token t should be similar to the high-level features for token t−1." But we also need to prevent collapse — we don't want ALL tokens in the batch to have the same high-level features (that would just be a constant bias term).

This is exactly the setup for contrastive learning: pull positive pairs together, push negative pairs apart.

Defining the Loss

Let zt = f0:h(xt) be the high-level features of token t. Let s(x, y) be the cosine similarity between x and y. We define:

Positive pairs: (zt(i), zt-1(i)) — high-level features of adjacent tokens from the same sequence. These should be similar.

Negative pairs: (zt(i), zt-1(j)) where i ≠ j — high-level features from different sequences in the batch. These should be dissimilar.

The contrastive loss is a symmetric InfoNCE-style objective:

Lcontr = −(1/N) ∑i log [exp(s(zt(i), zt-1(i))) / ∑j exp(s(zt(i), zt-1(j)))]
− (1/N) ∑i log [exp(s(zt(i), zt-1(i))) / ∑j exp(s(zt(j), zt-1(i)))]

Let's break this down piece by piece:

Deriving Why This Works: A Worked Example

Let's trace through a concrete batch with N = 4 sequences:

Sequence iToken t contentToken t-1 contents(zt(i), zt-1(i))
1"plants" (biology)"convert" (biology)0.9
2"war" (history)"Napoleon" (history)0.85
3"integral" (math)"derivative" (math)0.88
4"plaintiff" (law)"verdict" (law)0.82

Cross-sequence similarities (negatives) should be low:

For sequence 1, the first term becomes:

−log [exp(0.9) / (exp(0.9) + exp(0.1) + exp(0.15) + exp(0.05))]
= −log [2.46 / (2.46 + 1.11 + 1.16 + 1.05)]
= −log [2.46 / 5.78] = −log(0.425) = 0.855

If the model improves the positive similarity to 0.95 while keeping negatives at 0.1:

−log [exp(0.95) / (exp(0.95) + exp(0.1) + exp(0.1) + exp(0.1))]
= −log [2.59 / (2.59 + 3 × 1.11)] = −log [2.59 / 5.92] = −log(0.437) = 0.827

The loss decreases. The gradient pushes the encoder to make same-sequence high-level features more similar and cross-sequence features less similar.

The Full Loss

Combining everything, the total T-SAE loss is:

L = ∑i=1N Lmatr(xt(i)) + α Lcontr

where:

Why α = 1.0 just works: The reconstruction loss is in units of squared L2 distance (typically ~0.1-1.0). The contrastive loss is a log-softmax (typically ~1.0-3.0). At α = 1.0, both terms contribute meaningfully. The paper ablates this and finds that α = 1.0 is robust — no careful tuning required.

Training Data: Loading Activation Pairs

In practice, activations are loaded as pairs (xt, xt-1) — adjacent tokens from the same sequence. The pairs are shuffled across the batch so that negative pairs come from different sequences. This means each training batch is 2× the normal size (each sample is a pair), which reduces the effective batch size by half for the same memory budget.

What the paper doesn't emphasize: This 2× memory cost is the main computational overhead of T-SAEs. The contrastive loss itself is cheap (cosine similarities + log-softmax over the batch). But needing to store both xt and xt-1 for every sample in the batch halves throughput. This is a real limitation for very large-scale training.
Interactive Contrastive Loss

Drag the tokens to change their high-level feature representations. The contrastive loss computes in real time. Positive pairs (same sequence) want high similarity; negative pairs (different sequences) want low similarity. Watch the loss change as you move tokens.

Batch size N 4
α (contrastive weight) 1.0
Loss: --
python
def contrastive_loss(z_t, z_prev):
    # z_t, z_prev: (N, h) high-level features for adjacent tokens
    # Compute pairwise cosine similarities
    z_t_norm = z_t / z_t.norm(dim=-1, keepdim=True)
    z_prev_norm = z_prev / z_prev.norm(dim=-1, keepdim=True)
    sim = z_t_norm @ z_prev_norm.T      # (N, N)

    # Positive pairs are on the diagonal: sim[i, i]
    # Row-wise: for each z_t^(i), classify correct z_prev
    labels = torch.arange(z_t.size(0), device=z_t.device)
    loss_row = nn.functional.cross_entropy(sim, labels)

    # Column-wise: for each z_prev^(i), classify correct z_t
    loss_col = nn.functional.cross_entropy(sim.T, labels)

    return (loss_row + loss_col) / 2

def tsae_loss(x_t, x_prev, model, alpha=1.0):
    x_hat_t, x_hat_high_t, z_t = model(x_t)
    x_hat_prev, x_hat_high_prev, z_prev = model(x_prev)

    L_H = (x_t - x_hat_high_t).pow(2).sum(-1).mean()
    L_L = (x_t - x_hat_t).pow(2).sum(-1).mean()
    L_matr = L_H + L_L

    L_contr = contrastive_loss(z_t, z_prev)
    return L_matr + alpha * L_contr
In the contrastive loss, what are the positive and negative pairs?

Chapter 5: Disentanglement

The whole point of T-SAEs is to separate semantic from syntactic features. But does it actually work? And how do we even measure "disentanglement"?

Probing for Semantics, Context, and Syntax

The paper uses linear probes — simple logistic regression classifiers trained on top of SAE features — to measure what information each feature split encodes.

The setup: take MMLU questions (multiple-choice academic questions from 57 subjects), encode the last 20 tokens of each question through the LLM, extract SAE features, and train probes to predict:

Why this is a brilliant evaluation: If T-SAEs truly disentangle semantics from syntax, then:
• High-level features should be highly predictive of semantics and context, but not syntax.
• Low-level features should be predictive of syntax, but less so for semantics.
• Baseline SAEs (Matryoshka, BatchTopK) won't show this split — both feature groups will be similarly predictive of everything.

The Results

And that's exactly what happens. Using Gemma2-2b SAEs on MMLU:

SAE TypeSemantics (Acc)Context (Acc)Syntax (Acc)
T-SAE (all features)0.910.950.81
Matryoshka SAE0.820.870.83
BatchTopK SAE0.800.850.82
Baseline (model activations)0.840.890.79

The T-SAE beats baselines on semantics by 9+ percentage points and on context by 8+ points. It's slightly worse on syntax — but that's expected and actually desirable, because the high-level features are now capturing semantics instead of wasting capacity on syntax.

The TSNE Visualization

The paper's Figure 2 provides a stunning visual confirmation. When you plot T-SAE high-level feature activations in 2D (via TSNE) and color by question category, you see clean clusters for each subject. Matryoshka SAE features, plotted the same way, show no clear clustering.

Even more telling: when you color by syntax (part-of-speech), T-SAE low-level features cluster cleanly by POS tag, while T-SAE high-level features don't. The split is working.

Feature Disentanglement Probe

Each dot represents a token's SAE features projected to 2D. Toggle between labeling schemes to see how T-SAE features cluster differently than baseline SAEs. Orange = T-SAE high-level, Teal = T-SAE low-level, Gray = Baseline SAE.

Disentanglement Within the Feature Splits

The paper goes further: it probes each split separately. For T-SAEs:

For Matryoshka SAEs, both splits perform similarly on all three tasks — no specialization. The 20/80 partition is arbitrary without the contrastive loss to drive differentiation.

What the paper doesn't say but is crucial: The low-level features can still recover some semantic information (0.78 accuracy). This makes sense — even syntactic features carry some semantic signal (certain POS patterns correlate with certain topics). The disentanglement isn't perfect, but it's dramatically better than baselines.
When probing T-SAE feature splits separately, what pattern indicates successful disentanglement?

Chapter 6: Experiments

A natural worry: does adding the contrastive loss hurt reconstruction quality? If T-SAEs disentangle features but can't reconstruct the input, they're useless. The paper evaluates five standard SAE metrics.

Standard SAE Metrics

MetricWhat it measuresHigher or lower is better?
FVE (Fraction Variance Explained)1 − Var(x − x̂) / Var(x). How much of the input's variance the SAE captures.Higher ↑
Cosine Similaritycos(x, x̂). Are the reconstructions pointing in the right direction?Higher ↑
Fraction AliveWhat fraction of the 16k features activate at least once on the test data?Higher ↑
SmoothnessAverage max absolute change in active feature activations, normalized by the change in the model's activations. Lower = smoother.Lower ↓
AutoInterp ScoreCan an LLM (Llama3.3-70B) generate a correct feature explanation? SAEBench evaluation.Higher ↑

Core Results (Table 1 from the paper)

ModelSAEFVE ↑CosSim ↑Alive ↑Smooth (High) ↓Smooth (Low) ↓AutoInterp ↑
Pythia-160mTemporal SAE0.940.930.870.090.170.81 ± 0.17
Matryoshka SAE0.950.940.890.120.130.83 ± 0.16
BatchTopK SAE0.950.940.840.130.85 ± 0.15
Gemma2-2bTemporal SAE0.750.880.780.100.150.83 ± 0.15
Matryoshka SAE0.750.890.760.140.120.83 ± 0.16
BatchTopK SAE0.760.890.660.130.83 ± 0.16
The key takeaway: T-SAEs achieve competitive FVE and cosine similarity — reconstruction quality is essentially unchanged. But their high-level features are dramatically smoother (0.09 vs. 0.12-0.13 on Pythia). Smoothness is the quantitative signature of temporal consistency. And the AutoInterp score is comparable to baselines, meaning T-SAE features are just as interpretable to LLM judges.

Reading the Smoothness Metric

The smoothness metric deserves careful explanation. For a sequence of length T and a set of active features, we compute:

Δs = (1/n) ∑i=1n maxt |fi(xt) − fi(xt-1)| / ‖xt − xt-1

This is the maximum absolute change in feature activation, normalized by how much the underlying model activation changed. A smooth feature might have Δs = 0.05 (it barely changes even when the model activation changes a lot). A noisy syntactic feature might have Δs = 0.5 (it spikes wildly).

T-SAE high-level features have the lowest smoothness score of any method. Their low-level features are appropriately less smooth, confirming that the split works as intended.

Smoothness Comparison

Feature activations over a sequence for three SAE types. T-SAE high-level features (orange, solid) are visibly smoother than Matryoshka (gray, dashed) or BatchTopK (gray, dotted). T-SAE low-level features (teal) are appropriately less smooth. Click "New Sequence" for different realizations.

What About Fraction Alive?

An interesting detail: T-SAEs have a higher fraction of alive features (0.87 vs. 0.84 on Pythia). This means more of the 16k features activate at least once, suggesting that the contrastive loss encourages the SAE to use more of its feature capacity. Dead features (features that never activate) are wasted parameters — having fewer of them is a benefit.

Do T-SAEs sacrifice reconstruction quality for better disentanglement?

Chapter 7: The Safety Case Study

Academic metrics are nice, but do T-SAEs actually help with real problems? The paper presents two compelling case studies: understanding alignment datasets and steering model behavior.

Case Study 1: Understanding HH-RLHF

The HH-RLHF dataset (Bai et al., 2022) is Anthropic's human preference dataset used to train safety-focused models. It contains pairs of completions — one "chosen" (preferred by human raters) and one "rejected." The paper asks: what features differentiate chosen from rejected completions?

Method: For each completion pair, compute the difference in mean T-SAE feature activations (rejected − chosen). Features with the largest average difference are the ones that distinguish unsafe from safe content.

What T-SAEs find: The top features activated more in rejected (unsafe) completions include "varied text concepts," "etiquette and social behavior guidelines," "text about personal experiences and opinions," "social issues and controversy," "crime and malicious activities," and "violent or aggressive behavior descriptions." These are exactly the kind of safety-relevant semantic features you'd want for monitoring.

What Matryoshka SAEs find for the same analysis: "specific bicycle components," "terms related to data management," "references to ecosystem dynamics and environmental conditions." These are random noise features that happen to correlate with rejected completions for spurious reasons (like response length).

The Spurious Correlation Problem

Here's a subtle but critical finding. Some T-SAE features that show high activation differences are actually spuriously correlated with response length, not with actual safety content. The paper identifies these by computing the Pearson correlation between feature activation difference and response length difference.

FeatureAvg Diff (rejected − chosen)Corr with lengthType
transition words and phrases0.0630.52Length-related
legal and formal language0.0580.38Length-related
the word "the"0.0470.31Length-related
crime and malicious activities0.0600.12Semantically relevant
violent or aggressive behavior0.0440.05Semantically relevant
negative comments and insults0.043-0.08Semantically relevant

The semantically relevant features (green) have low correlation with length — they genuinely capture unsafe content, not just the fact that rejected responses tend to be longer. The length-related features (orange) are spurious confounders. T-SAEs recover both, but critically, the semantically relevant ones are clearly identified and separable.

Case Study 2: Steering

Can T-SAE features be used to steer model generation? The paper intervenes on features during inference by adding α · di to the model's residual stream (where di is the decoder column for feature i and α is the intervention strength).

The key finding: steering with high-level (semantic) features is dramatically more effective than steering with low-level (syntactic) features.

Practical implication: T-SAEs provide a principled way to find the "right" features for steering. Instead of trying thousands of features and hoping one works, you can restrict to the high-level partition. This partition was learned automatically from the data, not hand-selected.
Steering: Semantic vs. Syntactic Features

Simulated steering with high-level (semantic) vs. low-level (syntactic) features at varying intervention strengths. High-level steering changes the topic while maintaining coherence. Low-level steering degrades to repetition. Drag the strength slider to see the effect.

Intervention strength α 1.0
Why is steering with high-level T-SAE features more effective than steering with low-level features?

Chapter 8: Ablations

The paper ablates three key design choices in the T-SAE training pipeline. Each ablation answers a specific question about why the method works.

Ablation 1: Feature Split Ratio

How much of the feature space should be "high-level"? The paper tests 10/90, 20/80, and 50/50 splits.

Split (High/Low)FVESmoothness (High)SemanticsContextSyntax
10/900.940.080.880.920.82
20/80 (default)0.940.090.910.950.81
50/500.940.110.900.940.73
The tradeoff: As the high-level split grows (more features constrained to be smooth), semantic and context accuracy increase slightly, but syntax accuracy drops. At 50/50, too many features are forced to be smooth, starving the low-level partition of capacity to capture syntax. 20/80 is the sweet spot — enough high-level features for semantics, enough low-level features for syntax.

Ablation 2: Contrastive Window

Instead of contrasting with the immediately previous token (t-1), what if we contrast with a random token from the past context? The paper samples the contrastive partner uniformly from x1, ..., xt-1, where the random token xt-r has r < 25.

Contrastive PartnerSemanticsContextSyntax
Adjacent (t-1) [default]0.910.950.81
Random past (r < 25)0.891.060.71

Random contrasting boosts context accuracy significantly (+11%) because it encourages longer-range consistency. But it hurts syntax accuracy (−10%) because the low-level features now have less capacity — the high-level features are "greedier," capturing more information.

The insight: Depending on the application, you might prefer different contrastive windows. For safety monitoring (where semantic accuracy matters most), random contrasting might be better. For linguistic analysis (where you need syntax too), adjacent contrasting is safer.

Ablation 3: No Contrastive Loss (Just Matryoshka)

What if you remove the contrastive loss entirely and just use the Matryoshka reconstruction objective with the 20/80 split?

ConfigurationΔ SemanticsΔ ContextΔ Syntax
No contrastive (α = 0)−0.07−0.10+0.01
Naive L2 smoothness (not contrastive)−0.02+0.02+0.07

Without the contrastive loss, semantics drops by 7 points and context by 10 points. The Matryoshka split alone is not enough — the contrastive loss is essential for driving specialization.

The "naive L2 smoothness" ablation replaces the contrastive loss with a simple per-sample L2 penalty: ℓ = α‖zt − zt-122. This enforces smoothness but without the contrastive negative pairs. It performs worse on semantics and much better on syntax — because without negatives to prevent collapse, all high-level features converge to similar values, losing discriminative power.

Why contrastive beats naive smoothness: The naive L2 loss says "be similar to the previous token" but has no mechanism to prevent all sequences from having the same high-level features. The contrastive loss says "be similar to YOUR previous token but DIFFERENT from other sequences' previous tokens." The negatives are what prevent collapse and force the features to encode meaningful, discriminative semantic information.
Ablation Explorer

Adjust the contrastive weight α and split ratio to see how probe accuracies change. The bars show semantics (orange), context (blue), and syntax (teal). Watch how removing the contrastive loss (α=0) or changing the split ratio affects specialization.

α (contrastive weight) 1.0
High-level split % 20%
What happens when you replace the contrastive loss with a naive L2 smoothness penalty (ℓ = α‖zt − zt-12)?

Chapter 9: Connections

T-SAEs sit at the intersection of three research areas: mechanistic interpretability, contrastive representation learning, and computational linguistics. Let's map the connections and limitations.

Cheat Sheet: Every Key Equation

SymbolMeaningTypical Value
xtLLM activation at token tR768 or R2304
f(xt)Sparse feature vectorR16384, only 20 nonzero
f0:h(xt) = ztHigh-level (semantic) featuresFirst 20% of features
fh:m(xt)Low-level (syntactic) featuresRemaining 80% of features
LHHigh-level reconstruction loss‖x − Wdec0:hf0:h + b‖2
LLFull reconstruction loss‖x − Wdecf + b‖2
LcontrSymmetric contrastive loss (InfoNCE)On zt, zt-1 across batch
αContrastive weight1.0
LTotal lossLH + LL + αLcontr
s(x, y)Cosine similarity[−1, 1]
ΔsSmoothness metricLower = smoother features

How T-SAEs Relate to Other Methods

MethodKey Ideavs. T-SAEs
Standard SAESparse reconstructionNo temporal structure, features are noisy and syntactic
Matryoshka SAEHierarchical feature splitsSame split but no contrastive loss → no specialization
BatchTopK SAEFixed-k sparsityBetter sparsity control but still i.i.d. tokens
TranscodersCausal features via MLP replacementDifferent architecture; T-SAEs modify the loss, not the model
CPC / InfoNCEContrastive predictive codingT-SAEs apply the same principle to SAE feature learning
Griffiths et al. (2004)HMM + LDA for syntax/semanticsT-SAEs are the neural, unsupervised version of this idea

Limitations

Future Directions

Related Lessons

Prerequisite
Transformers from Zero — understand how LLMs produce the activations T-SAEs decompose.
Related
Contrastive Learning & CLIP — the InfoNCE loss that T-SAEs borrow for temporal consistency.
Application
Reward & Alignment — the RLHF pipeline whose training data T-SAEs help analyze.

"What I cannot create, I do not understand." — Richard Feynman. T-SAEs let us decompose what LLMs create into parts we can understand: meaning and grammar, separately.

What is the main limitation of the T-SAE contrastive formulation that the authors identify as future work?