Temporal Sparse Autoencoders

Chapter 0: The Problem

You train a Sparse Autoencoder on the internal representations of GPT-2, hoping to discover what the model "knows." You feed in thousands of tokens, learn 16,000 latent features, and excitedly look at what fires. Feature 11795: "the phrase 'The' at the start of sentences." Feature 3042: "sentence endings or periods." Feature 8901: "code syntax endings."

These are real features from publicly available SAEs on Neuronpedia. They are real, interpretable, and completely useless for understanding what the model is thinking.

You wanted to find concepts like "discussion of plant biology" or "scientific explanation" or "legal reasoning." Instead you got a catalog of punctuation patterns and capitalization rules. The SAE recovered syntax — the mechanical grammar of language — but missed semantics — the meaning.

The core problem: Standard SAEs treat every token as independent. They see "Photosynthesis" at position 1, "is" at position 2, "the" at position 3, and "process" at position 4 as four unrelated data points. But a human reader sees them as one idea — "this is a science sentence about plant biology." The SAE has no mechanism to discover that these tokens share a common semantic theme. So it defaults to the easiest pattern: surface-level syntax that changes token-by-token.

This isn't just an aesthetic complaint. Features that fire on "the word 'the'" are useless for three critical downstream tasks:

Safety monitoring: You want to detect when a model is generating harmful content. The topic of the generation (violence, deception, hate speech) is a semantic property that spans many tokens, not a per-token syntax pattern.
Steering: You want to intervene on a model's generation by amplifying or suppressing a feature. If a feature fires inconsistently across the sequence (noisy, per-token), steering with it just corrupts the output. You need features that activate smoothly over the whole generation.
Understanding: You want to know what the model "believes" about a passage. Semantic content — the topic, the intent, the audience — is the answer. Token-level syntax is noise.

The simulation below shows what happens when you decompose a sequence with a standard SAE vs. what we'd want to see. The standard SAE's top features are noisy and token-specific. The ideal features would be smooth and topic-aligned.

SAE Features: Noisy vs. Smooth

Top: a standard SAE's feature activations over a 3-topic sequence — noisy, per-token spikes. Bottom: ideal semantic features — smooth, topic-aligned. Click "Regenerate" to see different random realizations.

The question this paper asks: Can we modify the SAE training objective — with one additional loss term — to recover smooth, semantic features instead of noisy syntactic ones?

The answer, remarkably, is yes. And the modification is almost trivially simple.

Why do standard SAEs trained on LLM activations tend to recover syntactic features (like "the word 'the'") rather than semantic features (like "discussion of biology")?

Because LLMs don't encode semantic information — they only learn syntax Because SAEs treat each token independently, so they can't discover features that persist across multiple tokens in a sequence Because SAEs don't have enough features to capture both syntax and semantics

Chapter 1: The Key Insight

Read this sentence: "Photosynthesis is the process by which plants convert sunlight into energy."

Now ask yourself two questions about each token:

What topic is this about? Plant biology. Scientific explanation. Every single token in this sentence belongs to the same topic. The topic is temporally consistent — it doesn't change from one token to the next.
What syntactic role does this token play? "Photosynthesis" is a capitalized proper noun. "is" is a copula verb. "the" is a determiner. "by" is a preposition. These roles change every single token. Syntax is temporally local.

The key insight: Semantics evolve slowly over a sequence. Syntax changes rapidly. If you encourage some SAE features to activate consistently across adjacent tokens, those features will naturally capture semantics. The remaining features, unconstrained, will capture whatever the consistent features miss — which is syntax. Temporal consistency is a free supervisory signal for disentanglement.

This isn't a new idea in linguistics. Chomsky (1965) distinguished syntactic and semantic structure. Griffiths et al. (2004) formally argued that semantics exhibit long-range dependencies while syntax depends on short-range interactions. What's new is using this insight as a training objective for SAEs.

The Analogy: Radio Tuning

Think of an LLM's hidden representation at each token as a radio signal. It carries two overlapping broadcasts simultaneously:

The FM station (semantics): A slow-changing carrier wave. The topic, the intent, the style. It changes when the passage shifts from biology to history, but within a passage it's roughly constant.
The AM modulation (syntax): Rapid fluctuations riding on top of the carrier. Noun, verb, determiner, punctuation. These change every token.

A standard SAE mixes them together. It has no reason to separate them. T-SAEs add one constraint: some features must look like the FM station — smooth and consistent over time. The rest automatically become the AM modulation.

Temporal Consistency: Semantics vs. Syntax

Drag the slider to see how feature activations differ in temporal scale. Semantic features (orange) change slowly — they track the topic. Syntactic features (teal) fluctuate rapidly — they track per-token grammar. The insight: we can use this difference as a training signal.

Sequence length 50

What Makes This Different From Prior Work

Several prior approaches tried to improve SAEs: Matryoshka SAEs (Bussmann et al., 2025) learn hierarchical features but don't enforce temporal structure. Transcoders (Paulo et al., 2025) learn causal features but still treat tokens as i.i.d. BatchTopK (Bussmann et al., 2024) improves sparsity but doesn't address the semantic/syntactic split.

T-SAEs are the first to inject a structural prior from linguistics into SAE training: the temporal consistency of meaning. The modification is a single contrastive loss term added to the standard objective. Everything else — the encoder, decoder, sparsity — stays the same.

A useful way to think about it: Standard SAEs ask "what features best reconstruct this token?" T-SAEs ask "what features best reconstruct this token while also being consistent with what we'd expect from the previous token?" It's the difference between a single-frame image classifier and a video model — context matters.

Standard SAE

Input: token x_t (independent). Loss: reconstruction ‖x_t − x̂_t‖². Output: mixed features (mostly syntactic).

↓ add one term

Temporal SAE

Input: token pair (x_t, x_t-1). Loss: reconstruction + contrastive consistency on high-level features. Output: disentangled semantic + syntactic features.

Why does encouraging temporal consistency in some SAE features lead to semantic (rather than syntactic) feature recovery?

Because semantic content evolves slowly over a sequence (the topic stays the same across adjacent tokens), while syntax changes every token — so features forced to be consistent naturally capture semantics Because temporal consistency makes features more interpretable to humans Because contrastive learning always produces semantic features regardless of domain

Chapter 2: The Data Generating Process

Before building the T-SAE architecture, the authors formalize why temporal consistency should work. They propose a simple model of how language is generated, then show what it implies for dictionary learning.

The Speaker Model

Imagine a speaker producing a sequence of tokens τ₁, ..., τ_T. At each timestep t, the speaker's word choice is controlled by several factors:

h_t — high-level latent variables: The topic ("plant biology"), the intent ("explaining a concept"), the audience ("undergraduate students"), the style ("textbook prose"). These are properties of the passage, not the individual token. They change slowly.
l_t — low-level latent variables: The grammatical role of this specific word (noun, verb, article), the word's position in the sentence, spelling conventions. These are properties of this token specifically. They change every timestep.

The speaker produces each token as a function of context and both latent types:

τ_t = φ(τ^t-1, h_t, l_t)

where τ^t-1 = (τ₁, ..., τ_t-1) is all previous tokens. Think of h_t as "what to say" and l_t as "how to say this particular word."

Concrete example: You're writing about photosynthesis (h_t = "plant biology, scientific explanation"). At position t, your h_t hasn't changed — you're still talking about plant biology. But l_t determines whether this specific token is a verb ("convert"), a noun ("sunlight"), or a preposition ("into"). The high-level meaning is stable; the low-level grammar fluctuates.

Assumption 1: Temporal Consistency

The first key assumption formalizes what we just described intuitively:

Assumption 1 (Temporal Consistency): h_t is time invariant within a sequence. Two tokens x_t, x_t' from the same sequence should have similar high-level latents: h_t ≈ h_t'.

This doesn't mean h_t is exactly the same at every position — it means it changes slowly. A paragraph about biology has the same topic at token 1 and token 50. The topic might shift at the paragraph boundary (say, to history), but within a passage it's approximately constant.

Assumption 2: Hierarchical Representation

The second assumption says that the LLM's internal representation x_t encodes both h_t and l_t, and they're hierarchical:

Assumption 2 (Hierarchical Representation): There exists an invertible mapping g such that g(h_t, l_t) = x_t, and:

0 = ‖g(h_t, l_t) − x_t‖ ≤ ‖g(h_t, 0) − x_t‖ ≤ ε

In plain English: h_t alone can reconstruct x_t up to error ε. But l_t captures additional signal that h_t doesn't explain.

Why is this hierarchical? Because the high-level features do most of the reconstruction work. If you know the topic is "plant biology," you can already predict a lot about the activations. The low-level features clean up the residual — they capture the stuff that the topic alone can't explain (like whether this specific token is a noun or a verb).

What This Implies for SAE Design

If these assumptions hold (and the experimental evidence strongly supports them), then the optimal SAE training strategy has two parts:

Learn high-level features first by encouraging them to be temporally consistent. These features capture semantics and should be able to reconstruct x_t approximately.
Let low-level features learn the residual. Don't constrain them — they'll naturally capture whatever the high-level features miss, which is per-token syntactic information.

This is exactly the Matryoshka-style hierarchy (Bussmann et al., 2025), but with a crucial addition: the temporal contrastive loss on the high-level split. Without it, both splits learn the same kind of features. With it, they specialize.

Hierarchical Reconstruction

The high-level features (h) reconstruct most of the signal (shown as a smooth approximation). The low-level features (l) capture the fast-changing residual. Together they perfectly reconstruct x_t. Drag the slider to adjust how much the high-level features contribute (ε).

ε (residual size) 0.15

A Worked Example

Consider a 768-dimensional activation vector x_t from layer 12 of Pythia-160m. The T-SAE has m = 16,384 features, split 20/80: h = 3,277 high-level features and 13,107 low-level features.

At token t = "Photosynthesis" and token t+1 = "is":

The high-level features f_0:h(x_t) and f_0:h(x_t+1) should be similar — both encode "plant biology discussion." Their cosine similarity should be high (say, 0.85).
The low-level features f_h:m(x_t) and f_h:m(x_t+1) will be different — "Photosynthesis" (capitalized noun) vs. "is" (copula verb). No consistency constraint applies.
The high-level features alone reconstruct x_t with error ≤ ε. Adding the low-level features drives the error to zero.

In the data generating process, what does Assumption 2 (Hierarchical Representation) guarantee about high-level features h_t?

h_t can perfectly reconstruct x_t without any low-level information h_t alone can approximately reconstruct x_t (up to ε), with l_t capturing the remaining residual h_t and l_t contribute equally to reconstructing x_t

Chapter 3: The Architecture

T-SAEs modify the standard SAE architecture in exactly two ways: (1) partition the feature space into high-level and low-level splits, and (2) add a contrastive loss on the high-level split. Let's walk through every component.

The Encoder

The encoder takes a model activation x_t ∈ R^d (e.g., d = 768 for Pythia-160m, d = 2304 for Gemma2-2b) and produces a sparse feature vector f(x_t) ∈ R^m (e.g., m = 16,384 features):

f(x_t) = σ(W^enc x_t + b^enc)

where W^enc ∈ R^m×d is the encoder weight matrix, b^enc ∈ R^d is the encoder bias, and σ is the activation function. The paper uses BatchTopK activation (k = 20), meaning only the top 20 features (out of 16,384) are non-zero for each token. This enforces strict sparsity.

Data flow: x_t ∈ R⁷⁶⁸ → W^encx_t + b^enc ∈ R¹⁶³⁸⁴ → BatchTopK(k=20) → f(x_t) ∈ R¹⁶³⁸⁴ (only 20 nonzero entries). The encoder is a single linear layer followed by a sparsity-inducing activation. No hidden layers, no nonlinearity beyond the TopK selection.

The Feature Split

The m features are partitioned into two groups:

High-level features f_0:h(x_t) — the first h indices. These are the features that will be encouraged to be temporally consistent. In practice, h = 0.2m (20% of features).
Low-level features f_h:m(x_t) — the remaining m − h indices. These are unconstrained and learn to capture the residual (syntax, per-token patterns).

For a 16k SAE: h = 3,277 high-level features, 13,107 low-level features. The split ratio (20/80) is a hyperparameter — ablations in Section 4.6 of the paper show that 10/90 and 20/80 both work well, while 50/50 hurts syntax recovery.

The Decoder

The decoder reconstructs x_t from the sparse features:

x̂(f) = W^dec f(x_t) + b^dec

where W^dec ∈ R^d×m is the decoder matrix and b^dec ∈ R^d is the decoder bias. The decoder is split to match the feature partition:

W^dec_0:h ∈ R^d×h — decoder columns for high-level features
W^dec_h:m ∈ R^d×(m-h) — decoder columns for low-level features

Their concatenation equals the full decoder: W^dec = [W^dec_0:h | W^dec_h:m].

Why split the decoder too? Because the loss function has TWO reconstruction terms (Matryoshka-style): one that measures how well the high-level features alone reconstruct x_t, and one that measures how well ALL features reconstruct x_t. This forces the high-level features to do most of the work, with the low-level features capturing the residual — exactly matching Assumption 2.

Full Architecture Diagram

T-SAE Architecture

The complete data flow from input activation to reconstruction. High-level features (orange) are encouraged to be temporally consistent via contrastive loss. Low-level features (teal) capture the residual. Click components to highlight data flow.

Implementation Details

Component	Shape	Value
Input activation x_t	R^d	d = 768 (Pythia) or 2304 (Gemma2-2b)
Encoder W^enc	R^m×d	m = 16,384 features
Feature vector f(x_t)	R^m	Only k = 20 nonzero entries (BatchTopK)
High-level split	R^h	h = 0.2m = 3,277 features
Low-level split	R^m-h	m - h = 13,107 features
Decoder W^dec	R^d×m	Same d, m as encoder
Contrastive weight α	scalar	α = 1.0
Model layer	—	Layer 8 (Pythia), Layer 12 (Gemma)

python
import torch
import torch.nn as nn

class TemporalSAE(nn.Module):
    def __init__(self, d_model, n_features, n_high, k=20):
        super().__init__()
        self.n_high = n_high          # h: number of high-level features
        self.k = k                     # BatchTopK sparsity
        self.encoder = nn.Linear(d_model, n_features)
        self.decoder = nn.Linear(n_features, d_model)

    def encode(self, x):
        # x: (batch, d_model) -> f: (batch, n_features)
        pre_act = self.encoder(x)      # (B, m)
        # BatchTopK: keep only top-k activations per sample
        topk_vals, topk_idx = torch.topk(pre_act, self.k, dim=-1)
        f = torch.zeros_like(pre_act)
        f.scatter_(-1, topk_idx, topk_vals)
        return f

    def decode_high(self, f):
        # Reconstruct using only high-level features
        f_high = f.clone()
        f_high[:, self.n_high:] = 0   # zero out low-level
        return self.decoder(f_high)

    def forward(self, x):
        f = self.encode(x)             # (B, m), sparse
        x_hat = self.decoder(f)        # full reconstruction
        x_hat_high = self.decode_high(f)  # high-level only
        z = f[:, :self.n_high]         # high-level features for contrastive
        return x_hat, x_hat_high, z

In a T-SAE with 16,384 features and a 20/80 split, how many features are in the high-level partition, and what sparsity constraint is applied?

3,277 high-level features; BatchTopK with k=20 means only 20 features (out of all 16,384) are active per token 8,192 high-level features with L1 regularization 3,277 high-level features with no sparsity constraint on the high-level split

Chapter 4: The Contrastive Loss

This is the heart of the paper. Everything else — the feature split, the Matryoshka reconstruction, the architecture — is borrowed from prior work. The contrastive loss is what makes T-SAEs work.

Manufacturing the Need

Suppose you just train with the Matryoshka reconstruction loss (L_H + L_L) without any temporal constraint. What happens? Both feature groups learn the same kind of features. The high-level split has no reason to prefer smooth, semantic features over noisy, syntactic ones. The split is meaningless.

We need a loss term that says: "the high-level features for token t should be similar to the high-level features for token t−1." But we also need to prevent collapse — we don't want ALL tokens in the batch to have the same high-level features (that would just be a constant bias term).

This is exactly the setup for contrastive learning: pull positive pairs together, push negative pairs apart.

Defining the Loss

Let z_t = f_0:h(x_t) be the high-level features of token t. Let s(x, y) be the cosine similarity between x and y. We define:

Positive pairs: (z_t⁽ⁱ⁾, z_t-1⁽ⁱ⁾) — high-level features of adjacent tokens from the same sequence. These should be similar.

Negative pairs: (z_t⁽ⁱ⁾, z_t-1^(j)) where i ≠ j — high-level features from different sequences in the batch. These should be dissimilar.

The contrastive loss is a symmetric InfoNCE-style objective:

L_contr = −(1/N) ∑_i log [exp(s(z_t⁽ⁱ⁾, z_t-1⁽ⁱ⁾)) / ∑_j exp(s(z_t⁽ⁱ⁾, z_t-1^(j)))]
− (1/N) ∑_i log [exp(s(z_t⁽ⁱ⁾, z_t-1⁽ⁱ⁾)) / ∑_j exp(s(z_t^(j), z_t-1⁽ⁱ⁾))]

Let's break this down piece by piece:

First term: For each sequence i, how well can we identify the correct z_t-1⁽ⁱ⁾ among all the z_t-1^(j) in the batch? This is a "which previous token came from the same sequence?" classification problem.
Second term: The symmetric version — for each z_t-1⁽ⁱ⁾, can we identify the correct z_t⁽ⁱ⁾ among all z_t^(j)?
Symmetry prevents cheating: Without both terms, the model could collapse one side to a constant. The symmetry ensures both z_t and z_t-1 carry meaningful information.

Deriving Why This Works: A Worked Example

Let's trace through a concrete batch with N = 4 sequences:

Sequence i	Token t content	Token t-1 content	s(z_t⁽ⁱ⁾, z_t-1⁽ⁱ⁾)
1	"plants" (biology)	"convert" (biology)	0.9
2	"war" (history)	"Napoleon" (history)	0.85
3	"integral" (math)	"derivative" (math)	0.88
4	"plaintiff" (law)	"verdict" (law)	0.82

Cross-sequence similarities (negatives) should be low:

s(z_t⁽¹⁾, z_t-1⁽²⁾) = s("plants" features, "Napoleon" features) ≈ 0.1
s(z_t⁽¹⁾, z_t-1⁽³⁾) = s("plants" features, "derivative" features) ≈ 0.15
s(z_t⁽¹⁾, z_t-1⁽⁴⁾) = s("plants" features, "verdict" features) ≈ 0.05

For sequence 1, the first term becomes:

−log [exp(0.9) / (exp(0.9) + exp(0.1) + exp(0.15) + exp(0.05))]

= −log [2.46 / (2.46 + 1.11 + 1.16 + 1.05)]

= −log [2.46 / 5.78] = −log(0.425) = 0.855

If the model improves the positive similarity to 0.95 while keeping negatives at 0.1:

−log [exp(0.95) / (exp(0.95) + exp(0.1) + exp(0.1) + exp(0.1))]

= −log [2.59 / (2.59 + 3 × 1.11)] = −log [2.59 / 5.92] = −log(0.437) = 0.827

The loss decreases. The gradient pushes the encoder to make same-sequence high-level features more similar and cross-sequence features less similar.

The Full Loss

Combining everything, the total T-SAE loss is:

L = ∑_i=1^N L_matr(x_t⁽ⁱ⁾) + α L_contr

where:

L_matr = L_H + L_L — the Matryoshka reconstruction loss
L_H = ‖x_t − W^dec_0:h f_0:h(x_t) + b^dec‖₂² — high-level reconstruction
L_L = ‖x_t − W^dec f(x_t) + b^dec‖₂² — full reconstruction
L_contr — the symmetric contrastive loss on z_t = f_0:h(x_t)
α = 1.0 — contrastive weight (no tuning needed!)

Why α = 1.0 just works: The reconstruction loss is in units of squared L2 distance (typically ~0.1-1.0). The contrastive loss is a log-softmax (typically ~1.0-3.0). At α = 1.0, both terms contribute meaningfully. The paper ablates this and finds that α = 1.0 is robust — no careful tuning required.

Training Data: Loading Activation Pairs

In practice, activations are loaded as pairs (x_t, x_t-1) — adjacent tokens from the same sequence. The pairs are shuffled across the batch so that negative pairs come from different sequences. This means each training batch is 2× the normal size (each sample is a pair), which reduces the effective batch size by half for the same memory budget.

What the paper doesn't emphasize: This 2× memory cost is the main computational overhead of T-SAEs. The contrastive loss itself is cheap (cosine similarities + log-softmax over the batch). But needing to store both x_t and x_t-1 for every sample in the batch halves throughput. This is a real limitation for very large-scale training.

Interactive Contrastive Loss

Drag the tokens to change their high-level feature representations. The contrastive loss computes in real time. Positive pairs (same sequence) want high similarity; negative pairs (different sequences) want low similarity. Watch the loss change as you move tokens.

Batch size N 4

α (contrastive weight) 1.0

Loss: --

python
def contrastive_loss(z_t, z_prev):
    # z_t, z_prev: (N, h) high-level features for adjacent tokens
    # Compute pairwise cosine similarities
    z_t_norm = z_t / z_t.norm(dim=-1, keepdim=True)
    z_prev_norm = z_prev / z_prev.norm(dim=-1, keepdim=True)
    sim = z_t_norm @ z_prev_norm.T      # (N, N)

    # Positive pairs are on the diagonal: sim[i, i]
    # Row-wise: for each z_t^(i), classify correct z_prev
    labels = torch.arange(z_t.size(0), device=z_t.device)
    loss_row = nn.functional.cross_entropy(sim, labels)

    # Column-wise: for each z_prev^(i), classify correct z_t
    loss_col = nn.functional.cross_entropy(sim.T, labels)

    return (loss_row + loss_col) / 2

def tsae_loss(x_t, x_prev, model, alpha=1.0):
    x_hat_t, x_hat_high_t, z_t = model(x_t)
    x_hat_prev, x_hat_high_prev, z_prev = model(x_prev)

    L_H = (x_t - x_hat_high_t).pow(2).sum(-1).mean()
    L_L = (x_t - x_hat_t).pow(2).sum(-1).mean()
    L_matr = L_H + L_L

    L_contr = contrastive_loss(z_t, z_prev)
    return L_matr + alpha * L_contr

In the contrastive loss, what are the positive and negative pairs?

Positive: high-level features of adjacent tokens from the SAME sequence. Negative: high-level features from DIFFERENT sequences in the batch. The loss encourages same-sequence consistency while preventing collapse. Positive: all features of the same token. Negative: features from different layers of the model. Positive: high-level and low-level features within the same token. Negative: features from different tokens.

Chapter 5: Disentanglement

The whole point of T-SAEs is to separate semantic from syntactic features. But does it actually work? And how do we even measure "disentanglement"?

Probing for Semantics, Context, and Syntax

The paper uses linear probes — simple logistic regression classifiers trained on top of SAE features — to measure what information each feature split encodes.

The setup: take MMLU questions (multiple-choice academic questions from 57 subjects), encode the last 20 tokens of each question through the LLM, extract SAE features, and train probes to predict:

Semantics: The question's subject category ("High School European History," "Professional Medicine"). This is a sequence-level property — it doesn't change across the 20 tokens.
Context: The question ID (which specific question this is). This is also sequence-level but finer-grained than category.
Syntax: The part-of-speech tag of each token (noun, verb, adjective, etc.). This changes every token.

Why this is a brilliant evaluation: If T-SAEs truly disentangle semantics from syntax, then:
• High-level features should be highly predictive of semantics and context, but not syntax.
• Low-level features should be predictive of syntax, but less so for semantics.
• Baseline SAEs (Matryoshka, BatchTopK) won't show this split — both feature groups will be similarly predictive of everything.

The Results

And that's exactly what happens. Using Gemma2-2b SAEs on MMLU:

SAE Type	Semantics (Acc)	Context (Acc)	Syntax (Acc)
T-SAE (all features)	0.91	0.95	0.81
Matryoshka SAE	0.82	0.87	0.83
BatchTopK SAE	0.80	0.85	0.82
Baseline (model activations)	0.84	0.89	0.79

The T-SAE beats baselines on semantics by 9+ percentage points and on context by 8+ points. It's slightly worse on syntax — but that's expected and actually desirable, because the high-level features are now capturing semantics instead of wasting capacity on syntax.

The TSNE Visualization

The paper's Figure 2 provides a stunning visual confirmation. When you plot T-SAE high-level feature activations in 2D (via TSNE) and color by question category, you see clean clusters for each subject. Matryoshka SAE features, plotted the same way, show no clear clustering.

Even more telling: when you color by syntax (part-of-speech), T-SAE low-level features cluster cleanly by POS tag, while T-SAE high-level features don't. The split is working.

Feature Disentanglement Probe

Each dot represents a token's SAE features projected to 2D. Toggle between labeling schemes to see how T-SAE features cluster differently than baseline SAEs. Orange = T-SAE high-level, Teal = T-SAE low-level, Gray = Baseline SAE.

Disentanglement Within the Feature Splits

The paper goes further: it probes each split separately. For T-SAEs:

High-level split alone: High semantics accuracy (0.89), high context (0.93), low syntax (0.72). As expected — these features capture meaning, not grammar.
Low-level split alone: Lower semantics (0.78), lower context (0.82), but higher syntax (0.83). These features specialize in per-token patterns.

For Matryoshka SAEs, both splits perform similarly on all three tasks — no specialization. The 20/80 partition is arbitrary without the contrastive loss to drive differentiation.

What the paper doesn't say but is crucial: The low-level features can still recover some semantic information (0.78 accuracy). This makes sense — even syntactic features carry some semantic signal (certain POS patterns correlate with certain topics). The disentanglement isn't perfect, but it's dramatically better than baselines.

When probing T-SAE feature splits separately, what pattern indicates successful disentanglement?

Both splits have equal accuracy on all three probe tasks The high-level split has zero syntax accuracy High-level features excel at semantics/context but are weaker on syntax; low-level features show the opposite pattern — each split specializes

Chapter 6: Experiments

A natural worry: does adding the contrastive loss hurt reconstruction quality? If T-SAEs disentangle features but can't reconstruct the input, they're useless. The paper evaluates five standard SAE metrics.

Standard SAE Metrics

Metric	What it measures	Higher or lower is better?
FVE (Fraction Variance Explained)	1 − Var(x − x̂) / Var(x). How much of the input's variance the SAE captures.	Higher ↑
Cosine Similarity	cos(x, x̂). Are the reconstructions pointing in the right direction?	Higher ↑
Fraction Alive	What fraction of the 16k features activate at least once on the test data?	Higher ↑
Smoothness	Average max absolute change in active feature activations, normalized by the change in the model's activations. Lower = smoother.	Lower ↓
AutoInterp Score	Can an LLM (Llama3.3-70B) generate a correct feature explanation? SAEBench evaluation.	Higher ↑

Core Results (Table 1 from the paper)

Model	SAE	FVE ↑	CosSim ↑	Alive ↑	Smooth (High) ↓	Smooth (Low) ↓	AutoInterp ↑
Pythia-160m	Temporal SAE	0.94	0.93	0.87	0.09	0.17	0.81 ± 0.17
	Matryoshka SAE	0.95	0.94	0.89	0.12	0.13	0.83 ± 0.16
	BatchTopK SAE	0.95	0.94	0.84	0.13	—	0.85 ± 0.15
Gemma2-2b	Temporal SAE	0.75	0.88	0.78	0.10	0.15	0.83 ± 0.15
	Matryoshka SAE	0.75	0.89	0.76	0.14	0.12	0.83 ± 0.16
	BatchTopK SAE	0.76	0.89	0.66	0.13	—	0.83 ± 0.16

The key takeaway: T-SAEs achieve competitive FVE and cosine similarity — reconstruction quality is essentially unchanged. But their high-level features are dramatically smoother (0.09 vs. 0.12-0.13 on Pythia). Smoothness is the quantitative signature of temporal consistency. And the AutoInterp score is comparable to baselines, meaning T-SAE features are just as interpretable to LLM judges.

Reading the Smoothness Metric

The smoothness metric deserves careful explanation. For a sequence of length T and a set of active features, we compute:

Δ_s = (1/n) ∑_i=1ⁿ max_t |f_i(x_t) − f_i(x_t-1)| / ‖x_t − x_t-1‖

This is the maximum absolute change in feature activation, normalized by how much the underlying model activation changed. A smooth feature might have Δ_s = 0.05 (it barely changes even when the model activation changes a lot). A noisy syntactic feature might have Δ_s = 0.5 (it spikes wildly).

T-SAE high-level features have the lowest smoothness score of any method. Their low-level features are appropriately less smooth, confirming that the split works as intended.

Smoothness Comparison

Feature activations over a sequence for three SAE types. T-SAE high-level features (orange, solid) are visibly smoother than Matryoshka (gray, dashed) or BatchTopK (gray, dotted). T-SAE low-level features (teal) are appropriately less smooth. Click "New Sequence" for different realizations.

What About Fraction Alive?

An interesting detail: T-SAEs have a higher fraction of alive features (0.87 vs. 0.84 on Pythia). This means more of the 16k features activate at least once, suggesting that the contrastive loss encourages the SAE to use more of its feature capacity. Dead features (features that never activate) are wasted parameters — having fewer of them is a benefit.

Do T-SAEs sacrifice reconstruction quality for better disentanglement?

Yes, FVE drops significantly compared to baselines No — FVE and cosine similarity are essentially identical to baselines, while gaining much smoother high-level features and better semantic probe accuracy The reconstruction is better but interpretability is worse

Chapter 7: The Safety Case Study

Academic metrics are nice, but do T-SAEs actually help with real problems? The paper presents two compelling case studies: understanding alignment datasets and steering model behavior.

Case Study 1: Understanding HH-RLHF

The HH-RLHF dataset (Bai et al., 2022) is Anthropic's human preference dataset used to train safety-focused models. It contains pairs of completions — one "chosen" (preferred by human raters) and one "rejected." The paper asks: what features differentiate chosen from rejected completions?

Method: For each completion pair, compute the difference in mean T-SAE feature activations (rejected − chosen). Features with the largest average difference are the ones that distinguish unsafe from safe content.

What T-SAEs find: The top features activated more in rejected (unsafe) completions include "varied text concepts," "etiquette and social behavior guidelines," "text about personal experiences and opinions," "social issues and controversy," "crime and malicious activities," and "violent or aggressive behavior descriptions." These are exactly the kind of safety-relevant semantic features you'd want for monitoring.

What Matryoshka SAEs find for the same analysis: "specific bicycle components," "terms related to data management," "references to ecosystem dynamics and environmental conditions." These are random noise features that happen to correlate with rejected completions for spurious reasons (like response length).

The Spurious Correlation Problem

Here's a subtle but critical finding. Some T-SAE features that show high activation differences are actually spuriously correlated with response length, not with actual safety content. The paper identifies these by computing the Pearson correlation between feature activation difference and response length difference.

Feature	Avg Diff (rejected − chosen)	Corr with length	Type
transition words and phrases	0.063	0.52	Length-related
legal and formal language	0.058	0.38	Length-related
the word "the"	0.047	0.31	Length-related
crime and malicious activities	0.060	0.12	Semantically relevant
violent or aggressive behavior	0.044	0.05	Semantically relevant
negative comments and insults	0.043	-0.08	Semantically relevant

The semantically relevant features (green) have low correlation with length — they genuinely capture unsafe content, not just the fact that rejected responses tend to be longer. The length-related features (orange) are spurious confounders. T-SAEs recover both, but critically, the semantically relevant ones are clearly identified and separable.

Case Study 2: Steering

Can T-SAE features be used to steer model generation? The paper intervenes on features during inference by adding α · d_i to the model's residual stream (where d_i is the decoder column for feature i and α is the intervention strength).

The key finding: steering with high-level (semantic) features is dramatically more effective than steering with low-level (syntactic) features.

High-level steering: Changes the semantics of the generation. The model talks about different topics, with coherent text. Coherence is preserved even at high intervention strengths.
Low-level steering: Causes token repetition and degenerate text. Because syntactic features are per-token, amplifying them just repeats specific tokens or corrupts grammar.

Practical implication: T-SAEs provide a principled way to find the "right" features for steering. Instead of trying thousands of features and hoping one works, you can restrict to the high-level partition. This partition was learned automatically from the data, not hand-selected.

Steering: Semantic vs. Syntactic Features

Simulated steering with high-level (semantic) vs. low-level (syntactic) features at varying intervention strengths. High-level steering changes the topic while maintaining coherence. Low-level steering degrades to repetition. Drag the strength slider to see the effect.

Intervention strength α 1.0

Why is steering with high-level T-SAE features more effective than steering with low-level features?

High-level features capture sequence-level semantics (topic, intent), so amplifying them coherently changes what the model talks about. Low-level features are per-token syntax, so amplifying them causes degenerate token repetition. High-level features have larger decoder norms, so they have more impact on the output Low-level features are not learned properly during training

Chapter 8: Ablations

The paper ablates three key design choices in the T-SAE training pipeline. Each ablation answers a specific question about why the method works.

Ablation 1: Feature Split Ratio

How much of the feature space should be "high-level"? The paper tests 10/90, 20/80, and 50/50 splits.

Split (High/Low)	FVE	Smoothness (High)	Semantics	Context	Syntax
10/90	0.94	0.08	0.88	0.92	0.82
20/80 (default)	0.94	0.09	0.91	0.95	0.81
50/50	0.94	0.11	0.90	0.94	0.73

The tradeoff: As the high-level split grows (more features constrained to be smooth), semantic and context accuracy increase slightly, but syntax accuracy drops. At 50/50, too many features are forced to be smooth, starving the low-level partition of capacity to capture syntax. 20/80 is the sweet spot — enough high-level features for semantics, enough low-level features for syntax.

Ablation 2: Contrastive Window

Instead of contrasting with the immediately previous token (t-1), what if we contrast with a random token from the past context? The paper samples the contrastive partner uniformly from x₁, ..., x_t-1, where the random token x_t-r has r < 25.

Contrastive Partner	Semantics	Context	Syntax
Adjacent (t-1) [default]	0.91	0.95	0.81
Random past (r < 25)	0.89	1.06	0.71

Random contrasting boosts context accuracy significantly (+11%) because it encourages longer-range consistency. But it hurts syntax accuracy (−10%) because the low-level features now have less capacity — the high-level features are "greedier," capturing more information.

The insight: Depending on the application, you might prefer different contrastive windows. For safety monitoring (where semantic accuracy matters most), random contrasting might be better. For linguistic analysis (where you need syntax too), adjacent contrasting is safer.

Ablation 3: No Contrastive Loss (Just Matryoshka)

What if you remove the contrastive loss entirely and just use the Matryoshka reconstruction objective with the 20/80 split?

Configuration	Δ Semantics	Δ Context	Δ Syntax
No contrastive (α = 0)	−0.07	−0.10	+0.01
Naive L2 smoothness (not contrastive)	−0.02	+0.02	+0.07

Without the contrastive loss, semantics drops by 7 points and context by 10 points. The Matryoshka split alone is not enough — the contrastive loss is essential for driving specialization.

The "naive L2 smoothness" ablation replaces the contrastive loss with a simple per-sample L2 penalty: ℓ = α‖z_t − z_t-1‖₂². This enforces smoothness but without the contrastive negative pairs. It performs worse on semantics and much better on syntax — because without negatives to prevent collapse, all high-level features converge to similar values, losing discriminative power.

Why contrastive beats naive smoothness: The naive L2 loss says "be similar to the previous token" but has no mechanism to prevent all sequences from having the same high-level features. The contrastive loss says "be similar to YOUR previous token but DIFFERENT from other sequences' previous tokens." The negatives are what prevent collapse and force the features to encode meaningful, discriminative semantic information.

Ablation Explorer

Adjust the contrastive weight α and split ratio to see how probe accuracies change. The bars show semantics (orange), context (blue), and syntax (teal). Watch how removing the contrastive loss (α=0) or changing the split ratio affects specialization.

α (contrastive weight) 1.0

High-level split % 20%

What happens when you replace the contrastive loss with a naive L2 smoothness penalty (ℓ = α‖z_t − z_t-1‖²)?

Semantics drops because without contrastive negatives, all sequences' high-level features collapse to similar values — the features become smooth but not discriminative Performance is identical because both losses enforce smoothness The model fails to converge

Chapter 9: Connections

T-SAEs sit at the intersection of three research areas: mechanistic interpretability, contrastive representation learning, and computational linguistics. Let's map the connections and limitations.

Cheat Sheet: Every Key Equation

Symbol	Meaning	Typical Value
x_t	LLM activation at token t	R⁷⁶⁸ or R²³⁰⁴
f(x_t)	Sparse feature vector	R¹⁶³⁸⁴, only 20 nonzero
f_0:h(x_t) = z_t	High-level (semantic) features	First 20% of features
f_h:m(x_t)	Low-level (syntactic) features	Remaining 80% of features
L_H	High-level reconstruction loss	‖x − W^dec_0:hf_0:h + b‖²
L_L	Full reconstruction loss	‖x − W^decf + b‖²
L_contr	Symmetric contrastive loss (InfoNCE)	On z_t, z_t-1 across batch
α	Contrastive weight	1.0
L	Total loss	L_H + L_L + αL_contr
s(x, y)	Cosine similarity	[−1, 1]
Δ_s	Smoothness metric	Lower = smoother features

How T-SAEs Relate to Other Methods

Method	Key Idea	vs. T-SAEs
Standard SAE	Sparse reconstruction	No temporal structure, features are noisy and syntactic
Matryoshka SAE	Hierarchical feature splits	Same split but no contrastive loss → no specialization
BatchTopK SAE	Fixed-k sparsity	Better sparsity control but still i.i.d. tokens
Transcoders	Causal features via MLP replacement	Different architecture; T-SAEs modify the loss, not the model
CPC / InfoNCE	Contrastive predictive coding	T-SAEs apply the same principle to SAE feature learning
Griffiths et al. (2004)	HMM + LDA for syntax/semantics	T-SAEs are the neural, unsupervised version of this idea

Limitations

Memory cost: Each batch requires 2x activations (both x_t and x_t-1), halving the effective batch size for the same memory budget.
Binary split: Only two levels (high/low). Language has more than two timescales — book-level, paragraph-level, sentence-level, token-level. Multiple temporal hierarchies (multi-scale splits) are an obvious extension.
Contrastive formulation: SAE features are sparse and non-negative. Cosine similarity is designed for dense vectors in R^d. Alternative similarity measures for sparse non-negative spaces might work better.
Assumes adjacent tokens share semantics: This assumption breaks at topic boundaries (e.g., a paragraph about biology followed by one about history). The paper mitigates this by shuffling pairs, but the assumption is imperfect.
Only trained at one layer: The paper trains on layer 8 (Pythia) and layer 12 (Gemma). Different layers may benefit from different split ratios or contrastive strategies.

Future Directions

Multi-scale temporal hierarchy: Instead of 2 splits, use 4+ splits corresponding to book, paragraph, sentence, and token timescales. The contrastive window would differ for each level.
State tracking: Smooth, temporally consistent features could serve as "state trackers" for monitoring model behavior over long generations, detecting when the model shifts topic or intent.
Linearized contrastive loss: The paper derives (in Appendix A) a linear approximation to the contrastive loss that exploits the sparse, non-negative geometry of SAE features: −(1/N)∑s_ii + (1/N²)∑(1 + s_ij). This could be faster to optimize and more geometrically appropriate.

Related Lessons

Prerequisite

Transformers from Zero — understand how LLMs produce the activations T-SAEs decompose.

↓

Contrastive Learning & CLIP — the InfoNCE loss that T-SAEs borrow for temporal consistency.

↓

Application

Reward & Alignment — the RLHF pipeline whose training data T-SAEs help analyze.

"What I cannot create, I do not understand." — Richard Feynman. T-SAEs let us decompose what LLMs create into parts we can understand: meaning and grammar, separately.

What is the main limitation of the T-SAE contrastive formulation that the authors identify as future work?

T-SAEs require too many features to work properly The contrastive loss is too expensive to compute SAE features are sparse and non-negative, but the contrastive loss uses cosine similarity designed for dense vectors in R^d — alternative formulations for sparse non-negative spaces may be more geometrically appropriate