Energy-Based Transformers

Chapter 0: The Problem

You're a student taking a math exam. You write down an answer to a problem — but before moving on, you pause. You check your work. You substitute your answer back into the original equation. Does it satisfy the constraints? If not, you revise. You keep checking until you're confident.

This is System 2 Thinking — the slow, deliberate, effortful reasoning that psychologists Daniel Kahneman and Amos Tversky distinguished from System 1 (fast, automatic, intuitive). System 1 is catching a ball. System 2 is computing 17 × 24 in your head.

Modern AI has gotten remarkably good at System 1. A GPT-class model sees "The dog caught the ___" and instantly outputs "ball" in a single feed-forward pass. No checking. No revision. Just pattern matching at light speed.

But what happens when the problem is hard? When the model encounters something it's never seen during training? When the correct answer requires multi-step reasoning?

The fundamental mismatch: Current feed-forward transformers allocate the SAME amount of computation to every prediction — whether it's predicting "the" after "at" (trivial) or solving a novel math problem (hard). Humans allocate MORE effort to harder problems. Feed-forward transformers cannot.

Recent attempts at System 2 Thinking in AI (o1, DeepSeek-R1, Chain of Thought) have shown exciting results, but they share critical limitations:

Approach	Dynamic Compute?	Self-Verification?	Modality	Requires Supervision?
Chain of Thought	Partial (more tokens)	No	Text only	No, but unreliable
o1 / DeepSeek-R1	Yes (more tokens)	Implicit	Text only	Yes (RL + verifiers)
Best-of-N Sampling	Yes	Requires external verifier	Text only	Yes (trained verifier)
Diffusion Models	Yes (more steps)	No explicit verification	Continuous only	No
EBTs (this paper)	Yes (per prediction)	Yes (energy scalar)	Both discrete & continuous	No — unsupervised

Notice the bottom row. EBTs are the only approach that has all four properties: dynamic computation per prediction, built-in self-verification, works across modalities, and requires no additional supervision beyond standard pretraining.

That's an extraordinary claim. To understand why it works, we need to step back and ask: what does it mean for a model to "verify" its own predictions?

Two facets of human thinking

The paper identifies two cognitive facets that current models mostly lack:

Facet 1: Dynamic Allocation of Computation. Deciding whether to change careers takes more thought than deciding what to eat for lunch. Humans naturally allocate varying amounts of effort depending on difficulty. Feed-forward transformers use the same depth (same FLOPs) for every single prediction. RNNs and diffusion models can vary computation, but lack explicit verification.

Facet 2: Verification of Predictions. Humans don't just generate answers — they check them. A student doesn't just write "x = 5" and move on; they substitute back into the equation to verify. Standard transformers have no mechanism to assess the quality of their own outputs. They predict in the output space, not in a space where the prediction can be compared against the input.

The canvas above shows the key architectural difference. A standard autoregressive transformer maps context to a prediction in one pass. An EBT takes a candidate prediction, feeds it alongside the context, and outputs an energy scalar — a single number indicating how compatible this prediction is with the context. Lower energy = better match.

Now prediction becomes optimization: start with a random guess, compute the energy, take the gradient of energy with respect to the prediction, update the prediction to lower the energy, and repeat. Each iteration is a "thinking step." More steps = more thinking = better prediction.

The core question of this paper: "Can we rely entirely on unsupervised learning to develop System 2 Thinking?" The answer is yes — by learning a verifier (the energy function) instead of a generator, and then generating by optimizing with respect to the verifier.

Why can't standard feed-forward transformers perform System 2 Thinking?

They allocate the same computation to every prediction and have no mechanism to verify or refine their outputs They are too small to handle complex reasoning They can only process text, not images

Chapter 1: The Key Insight

Here's a fundamental asymmetry that has been hiding in plain sight for decades: verification is easier than generation.

Think about it. Given a completed Sudoku puzzle, you can verify it in seconds — check each row, column, and 3×3 block. But solving a Sudoku from scratch? That takes minutes of careful reasoning. Given a proof, a mathematician can verify each step far more easily than they could have discovered the proof in the first place. Given a solution to an NP-hard problem, you can check it in polynomial time, but finding it requires exponential time.

This asymmetry is well-known in complexity theory (P vs NP), cryptography (verifying a signature vs. forging one), and everyday life (proofreading vs. writing). Yet current AI models do the HARD thing — they generate directly in one shot — and never attempt the EASY thing: verification.

The paradigm shift: Instead of training a model to GENERATE predictions (hard), train it to VERIFY compatibility between inputs and candidate predictions (easier). Then generate by optimizing candidates to minimize the verifier's energy. The verifier is the model; generation is just optimization.

This is the core idea of Energy-Based Models (EBMs). An EBM is a function E_θ(x, y) that takes an input x (context) and a candidate prediction y, and outputs a single scalar: the energy. Low energy means "y is a good prediction given x." High energy means "y is incompatible with x."

The verification-generation duality

Let's make this concrete with next-token prediction. In a standard transformer:

Input

Context tokens: "The dog caught the"

↓

Forward pass (one shot)

Softmax over vocabulary → probability for each token

↓

Output

Most likely token: "ball" (p = 0.43)

In an EBT:

Input

Context: "The dog caught the" + Candidate: [random token distribution]

↓

Forward pass

Energy scalar: E = 8.3 (high → bad match)

↓

Gradient ∇_y E

Direction to update the candidate to lower energy

↓

Update candidate

y ← y − α ∇_y E (gradient descent on prediction)

↓ repeat N times

Converged

Energy: E = 0.7 (low → good match). Final prediction: "ball"

The EBT did something remarkable: it ran gradient descent on its own predictions. Not on the model weights — those are frozen at inference time. On the predictions themselves. Each iteration is a "thinking step" where the model refines its guess to better match the context.

Why this enables System 2 Thinking

This formulation naturally gives us both cognitive facets:

Dynamic computation (Facet 1): Easy predictions converge in 2-3 steps. Hard predictions need 10+ steps. The model automatically allocates more computation to harder problems, because harder problems have more complex energy landscapes that take longer to optimize.

Self-verification (Facet 2): The energy value itself IS the verification. After thinking, the model can check: "Is this energy low enough? Should I keep thinking?" It can also generate multiple candidates and pick the one with the lowest energy — a form of best-of-N sampling with a built-in verifier, requiring no external model.

The simulation above shows the verification-generation asymmetry in action. On the left, a standard transformer generates in one pass — fast but can't improve. On the right, an EBT starts from a random prediction and iteratively refines it by minimizing energy. Watch how the EBT's prediction progressively shifts toward the correct distribution over multiple steps.

Analogy: Think of the EBT as a student who doesn't just blurt out answers (System 1) but has an internal "confidence meter" (energy) and keeps revising until the meter reads "high confidence" (low energy). The more confused the student is initially (high energy), the more revision steps they take. The revision process IS the thinking.

A concrete numerical example

Let's trace what happens when an EBT predicts the next token for the context "The dog caught the ___". The vocabulary has 50,277 tokens. The prediction y is a distribution over these tokens (a vector of 50,277 logits).

Step 0 (initialization): y₀ ~ N(0, I). All 50,277 logits drawn from a standard Gaussian. This is pure noise — essentially uniform over the vocabulary.

Step 0 energy: E_θ(x, y₀) = 14.2. Very high. The random prediction is incompatible with the context.

Step 1: Compute ∇_y E_θ(x, y₀). This 50,277-dimensional gradient tells us how to adjust each logit to reduce energy. Update: y₁ = y₀ − α ∇_y E. Energy drops to 8.7.

Step 2: Repeat. y₂ = y₁ − α ∇_y E_θ(x, y₁). Energy drops to 3.1. The distribution is starting to concentrate on plausible tokens: "ball", "frisbee", "stick".

Step 3: y₃ = y₂ − α ∇_y E_θ(x, y₂). Energy drops to 0.9. Strong peak at "ball".

Each step required a full forward AND backward pass through the transformer to compute E and ∇_y E. Three thinking steps = 6 passes (3 forward + 3 backward). This is 6× more compute than a standard transformer's single forward pass. But the model gets to THINK — and the result is better.

In an EBT, what does gradient descent optimize at inference time?

The model's parameters (weights) The prediction itself — the candidate output is iteratively refined to minimize the energy function The learning rate of the optimizer

Chapter 2: Energy-Based Models Background

Before we dive into the architecture, we need solid footing on Energy-Based Models. The concept has been around for decades — Hopfield networks (1982), Boltzmann machines (1985), contrastive learning — but making them work at scale has been an open problem until now.

What is an energy function?

An energy function E_θ(x, y) is just a neural network that takes two inputs — some context x and a candidate prediction y — and outputs a single scalar. That's it. No softmax. No probability distribution. Just a number.

The interpretation is physical: think of a ball on a hilly landscape. The ball naturally rolls to the lowest point — the minimum energy state. High hills = high energy = unlikely configurations. Low valleys = low energy = likely configurations.

E_θ : X × Y → R

For a given context x, the energy function defines a landscape over all possible predictions y. The ground truth prediction y* should sit at the bottom of a valley — a local (ideally global) minimum of E_θ(x, ·).

From energy to probability (and why we avoid it)

You CAN convert an energy function into a probability distribution using the Boltzmann distribution:

p_θ(y | x) = e^{−E_θ(x, y)} / Z(θ)

where Z(θ) = ∫ e^{−E_θ(x, y)} dy is the partition function — an integral over ALL possible predictions y.

Here's the problem: computing Z(θ) requires integrating over every possible y. For a vocabulary of 50,277 tokens, that's a sum over 50,277 terms (doable). But for continuous predictions like images (256 × 256 × 3 real-valued pixels), that integral is over a space of dimension 196,608. Completely intractable.

Key decision: EBTs use unnormalized EBMs. They never compute the partition function Z. They don't model probabilities at all — just relative energies. The training objective shapes the landscape to have low energy near true data and high energy everywhere else, WITHOUT needing to know the normalization constant. This is what makes EBMs scalable.

Two approaches to training EBMs

Historically, there have been two main approaches, and understanding why one fails at scale is crucial to understanding the EBT paper's contribution:

Approach 1: Contrastive methods. Push down the energy of "positive" (real) samples while pushing up the energy of "negative" (fake) samples. The catch: to push up negative energy everywhere in a high-dimensional space, you need an exponentially growing number of negatives. This is the curse of dimensionality — contrastive methods don't scale.

Approach 2: Optimization-based training. Train the EBM so that gradient descent FROM A RANDOM STARTING POINT converges to the ground truth y*. This implicitly shapes the landscape to have a valley at y* without needing negative samples. This is what EBTs use.

Optimization-based training in detail

Let's derive the training procedure step by step. We have:

An EBM E_θ(x, y) with trainable parameters θ
A training example (x, y*) — context and ground truth prediction
Step size α and number of optimization steps M

Forward pass (energy minimization on the prediction):

Start with a random prediction: y₀ ~ N(0, I)

For i = 0, 1, ..., M − 1:

ŷ_i+1 = ŷ_i − α ∇_y E_θ(x, ŷ_i)

This is gradient descent on the prediction. At each step, we compute the gradient of the energy WITH RESPECT TO y (not θ), and update y to reduce the energy. After M steps, we have a refined prediction ŷ_M.

Backward pass (update model parameters):

Compute a loss comparing ŷ_M to the ground truth y*:

L = J(ŷ_M, y*)

where J can be any standard loss (cross-entropy for discrete, MSE for continuous). Now backpropagate THROUGH THE ENTIRE OPTIMIZATION PROCESS to update θ.

The critical insight: The loss is backpropagated THROUGH all M gradient descent steps. This means the model learns parameters θ such that gradient descent on its energy landscape will lead from random initialization to the correct prediction. The model is not just learning to assign low energy to the right answer — it's learning an energy landscape whose GRADIENT FLOW leads to the right answer.

Why this avoids the curse of dimensionality

Contrastive methods must explicitly push up energy in an exponentially large space. The optimization approach does something more subtle: by training the model so that gradient descent converges to y*, it implicitly regularizes the landscape to have a single local minimum near y*. The landscape is shaped by the optimization dynamics, not by explicit negative sampling.

Think of it like sculpting. Contrastive methods try to chip away at every piece of marble that isn't the statue (exponential work in high dimensions). The optimization approach shapes the landscape so that water (gradient descent) naturally flows to the statue (the answer). You only need to shape the channels, not remove all the marble.

Second-order gradients: the hidden cost

Here's a subtlety that's easy to miss. During the forward pass, we compute ∇_y E_θ — a gradient of E with respect to y. During the backward pass, we differentiate the LOSS (which depends on these gradients) with respect to θ. This is a gradient of a gradient — a second-order derivative.

Computing full Hessians is O(n²) in the number of parameters. But EBTs use Hessian-vector products (HVPs), which can be computed in O(n) time — the same cost as a standard backward pass. In PyTorch, this is implemented via torch.autograd.grad with create_graph=True, which keeps the computation graph alive for a second differentiation.

The total cost per training step with M optimization steps is approximately:

FLOPs_EBT ≈ M × (Forward + Backward + Backward) × 2 ≈ M × 10N × 2

where N is the number of non-embedding parameters. With M = 2 (the paper's default), this is roughly 3.33× more expensive per training step than a standard Transformer++ step (which costs 6N FLOPs).

The simulation above lets you explore an energy landscape. The blue dot is the current prediction, and the red star is the ground truth. Click "Step" to perform one gradient descent step on the prediction. Watch how the prediction moves downhill toward the energy minimum. The landscape is shaped so that the minimum coincides with the ground truth — this is what training achieves.

Why do EBTs use optimization-based training instead of contrastive training?

Contrastive methods require exponentially many negative samples in high-dimensional spaces (curse of dimensionality), while optimization-based training implicitly shapes the landscape through gradient flow, scaling gracefully Contrastive methods are slower to train Contrastive methods require supervised labels

Chapter 3: The EBT Architecture

Now that we understand the energy-based paradigm, let's see how to actually build a Transformer that functions as an energy model. This is where EBTs differ from standard transformers in subtle but critical ways.

Two variants: autoregressive and bidirectional

The paper introduces two EBT architectures:

Autoregressive EBT (AR-EBT): A GPT-style decoder-only transformer for next-token prediction. Uses causal attention. This is the primary variant for language modeling experiments.

Bidirectional EBT (Bi-EBT): A BERT-style transformer with full bidirectional attention. Used for image denoising and classification experiments. Built on the DiT (Diffusion Transformer) architecture.

The key difference from standard transformers

In a standard transformer, the model takes context x and produces a prediction in the OUTPUT space — it generates y directly. In an EBT, the model takes BOTH x and a candidate y and produces an energy scalar. This means predictions must live in the INPUT space, not the output space.

Standard Transformer

Input: x = [x₁, ..., x_n] → Output: logits for x_n+1 (prediction in output space)

AR-EBT

Input: [x₁, ..., x_n, ŷ_n+1] → Output: E_θ(x, ŷ) scalar (candidate prediction in input space)

This difference has profound consequences for the attention mechanism.

The attention challenge in autoregressive EBTs

In a standard causal transformer with n tokens, the attention mask is lower-triangular: token i can attend to tokens 1 through i. After the causal mask, the n × n attention scores matrix looks like:

scores = [α_i,j] where α_i,j = 0 for j > i

For an EBT, we have n past (context) tokens z₁, ..., z_n and we're predicting future tokens ŷ_n+1, ..., ŷ_n+k. The attention matrix must be (n+k) × (n+1) and satisfy special constraints:

Rule 1: Each past token z_i attends to all previous past tokens z₁, ..., z_i (standard causal attention).

Rule 2: Each predicted future token ŷ_j attends to ALL past tokens z₁, ..., z_n (it can see the full context).

Rule 3: Each predicted future token ŷ_j attends to ITSELF (to incorporate its own representation), but NOT to other predicted tokens.

Rule 3 is the tricky part. In standard attention, the diagonal comes for free from the Q K^T computation. But here, each predicted token ŷ_j is a DIFFERENT prediction (different gradient descent trajectory), so the "self-attention" on the diagonal can't be computed with a single matrix multiply.

Why this matters for parallelism: During training, you want to predict ALL next tokens in parallel (like standard language model training). But each predicted token ŷ_j has its own evolving representation that can't share information with other predicted tokens. The paper solves this with a clever masking scheme that separates the attention into two computations: one for past tokens (standard) and one for predicted tokens (requires extracting and replacing the superdiagonal).

The efficient attention implementation

The paper splits the sequence into two groups:

zⁿ₁ — the first n − 1 elements = original context tokens
z_p — the last n − 1 elements = predicted future states

For the context tokens, attention is computed normally:

Attention(Q_o, K_o, V_o)_zⁿ₁ = softmax(Q_o K_o^T / √d_k) V_o

For the predicted tokens, the paper computes separate Q_p, K_p, V_p matrices and constructs the attention in four steps:

Compute unnormalized scores: Q_p K_o^T / √d_k (predicted queries attending to context keys)
Extract the superdiagonal (each predicted token's self-attention score) using: self_attention = sum(Q_p * K_p, dim=head_dim) — a Hadamard product, not a matrix multiply
Replace the superdiagonal in the scores matrix with these self-attention values
Apply softmax and multiply by [V_o ; V_p] to get updated representations

This scheme effectively doubles the sequence length (from n to 2n) but thanks to the efficient masking, the total FLOPs are approximately 2× a standard transformer, not 4×.

The energy head

After the transformer blocks, the EBT needs to produce a single energy scalar. For autoregressive EBTs:

Last hidden state of each predicted token

h_n+1, ..., h_n+k ∈ R^d

↓

Linear projection

e_j = W_energy h_j + b ∈ R (one scalar per predicted token)

↓

Sum energies

E_θ(x, ŷ) = ∑_j e_j

The energy is the sum of per-token energy scalars. This means each token contributes independently to the total energy, allowing the model to identify WHICH tokens are problematic (high individual energy) — this is what enables the uncertainty visualizations in the paper's results.

Full architecture diagram

The diagram above shows the complete data flow of an autoregressive EBT. Context tokens (green) flow through normal causal attention. Predicted tokens (orange) are initialized randomly and attend to all context tokens plus themselves. Each transformer block updates both representations. The final energy head produces a scalar per predicted token, which are summed into a total energy.

Model sizes

Size	Non-Embed Params	# Layers	Embed Dim	# Heads
XXS	6.18M	6	384	6
XS	12.4M	12	384	6
Small	48.8M	12	768	12
Medium	176M	24	1024	16
Large	396M	24	1536	16
XL	708M	24	2048	32

The architecture follows the Llama 2 / Transformer++ recipe: RMSNorm, RoPE position embeddings, SwiGLU activation, no bias terms. The only additions are the energy head and the modified attention scheme for predicted tokens.

Why can't each predicted token attend to other predicted tokens in an autoregressive EBT?

Each predicted token has its own evolving representation from a separate gradient descent trajectory — sharing information between predicted tokens would break the independence needed for parallel training It would make the model too slow The attention mask is already too large

Chapter 4: Training EBTs

We've seen the architecture. Now the hard part: actually training these models to produce well-shaped energy landscapes. This chapter covers the training algorithms in detail, including the GAN-like duality that makes EBTs work.

Algorithm 1: Training

Let's walk through a single training step, line by line. We have a training example (x, y*), the EBM E_θ, step size α, number of optimization steps M, and a loss function J.

# Algorithm 1: EBT Training
def train_step(x, y_star, E_theta, alpha, M, loss_fn):
    # Step 1: Initialize prediction from random noise
    y_hat = torch.randn_like(y_star)  # y_0 ~ N(0, I)
    y_hat.requires_grad_(True)

    # Step 2: Run M gradient descent steps on the prediction
    for i in range(M):
        energy = E_theta(x, y_hat)
        # Gradient of energy w.r.t. prediction (NOT params)
        grad_y = torch.autograd.grad(
            energy, y_hat, create_graph=True
        )[0]
        y_hat = y_hat - alpha * grad_y

    # Step 3: Compute loss against ground truth
    loss = loss_fn(y_hat, y_star)

    # Step 4: Backpropagate through EVERYTHING
    # (including all M gradient steps) to update theta
    loss.backward()
    optimizer.step()

The crucial detail is create_graph=True on line 10. Without this flag, PyTorch discards the computation graph after computing the gradient, making it impossible to backpropagate through the optimization steps. With it, the entire M-step optimization trajectory is part of the computation graph.

The GAN analogy: During the FORWARD pass (steps 1-2), the EBM acts as a GAN discriminator — it evaluates the compatibility of the candidate prediction with the context. During the BACKWARD pass (steps 3-4), the optimization process acts as a GAN generator — the gradients flow through the optimization steps to update θ so that gradient descent produces predictions closer to y*. Unlike GANs, the verifier and generator are the SAME model, avoiding adversarial instability.

Worked example: one training step

Let's trace through with concrete numbers. Suppose we're training on a tiny vocabulary of 5 tokens: {cat, dog, the, runs, fast}. The context is "the dog ___" and the ground truth next token is "runs" (one-hot: [0, 0, 0, 1, 0]).

Step 0: ŷ₀ = [0.3, −0.7, 1.2, −0.4, 0.8] (random Gaussian). Energy: E = 12.5.

Gradient: ∇_y E = [0.1, −0.3, 0.6, −0.9, 0.5]. The model "knows" it should push probability toward index 3 (runs) — that dimension has the most negative gradient.

Update (alpha=0.5): ŷ₁ = [0.3 − 0.05, −0.7 + 0.15, 1.2 − 0.3, −0.4 + 0.45, 0.8 − 0.25] = [0.25, −0.55, 0.9, 0.05, 0.55]

Step 1 energy: E = 6.8. Better, but still high.

Step 2: Another gradient step. ŷ₂ = [0.1, −0.3, 0.4, 0.8, 0.2]. Energy: 2.1. Now "runs" has the highest logit.

Loss: Cross-entropy between softmax(ŷ₂) = [0.12, 0.08, 0.16, 0.40, 0.13] and one-hot [0, 0, 0, 1, 0]. Loss = −log(0.40) = 0.92.

Backprop: This loss gradient flows backward THROUGH both gradient descent steps AND the energy function to update θ. The model learns to shape E_θ so that future gradient descent starting from random noise will converge more quickly to "runs".

The S1 vs S2 training modes

The paper discovers an important distinction between two training configurations:

S1 (System 1) models: Optimized for stability and learning convergence. The gradients of predictions are DETACHED between optimization steps — backpropagation doesn't flow through the entire chain. Simpler, more stable, but weaker thinking capabilities.

S2 (System 2) models: Full backpropagation through all optimization steps (no detaching). Include all energy landscape regularization techniques (next chapter). More expensive, but enable genuine System 2 Thinking.

Property	S1 Models	S2 Models
Gradient flow	Detached between steps	Full backprop through all steps
Replay buffer	No	Yes
Langevin dynamics	No	Yes
Random step size	No	Yes
Training stability	Higher	Lower (more careful tuning)
Learning scaling rate	Baseline	~3.3% faster
Thinking ability	Limited	Strong (up to 29% improvement)

The paper finds that S1 and S2 models have similar scaling rates — the S2 scaling curve is just shifted up (higher initial loss but same slope). This means you can choose S1 for pure pretraining and switch to S2 when you need thinking capabilities, without losing scaling behavior.

The simulation above lets you watch EBT training in action. The energy landscape (blue curve) starts flat and gradually develops a minimum at the ground truth location (red star). Each training iteration updates θ so that the landscape's gradient flow points more strongly toward the target. Toggle between S1 and S2 modes to see the difference in landscape regularization.

Implementation detail: learnable step size

The optimization step size α has a major impact on training. Too large: predictions overshoot and oscillate. Too small: convergence is slow and the model wastes optimization steps. The paper makes α learnable — it's a trainable parameter that the model optimizes alongside its weights. For S2 models, the step size is multiplied by 1500 relative to S1, which the paper finds is necessary for the full backpropagation to produce useful gradients.

What does create_graph=True enable in EBT training?

It preserves the computation graph when computing gradients, allowing backpropagation THROUGH the gradient descent steps to update model parameters θ It enables GPU acceleration It allows the model to use mixed precision training

Chapter 5: Shaping the Energy Landscape

Training an EBM to have well-shaped energy landscapes in high-dimensional space is like trying to sculpt a mountain range where every valley leads to the correct answer. It's hard. The paper identifies three regularization techniques that are critical for enabling System 2 Thinking. Each one addresses a specific failure mode.

Problem 1: The landscape has dead zones

Without regularization, the energy landscape tends to be well-shaped only in a narrow region around the ground truth. Start gradient descent from far away, and you hit flat regions where gradients vanish — the optimization gets stuck. This means thinking for more steps doesn't help.

Solution: Replay Buffer. Instead of always initializing predictions from random noise N(0, I), maintain a buffer of partially-optimized predictions from previous training steps. Occasionally initialize FROM THESE rather than from scratch. This forces the model to learn landscapes that are well-shaped even far from the optimum, because the training loss depends on successfully navigating from these diverse starting points.

# Replay buffer implementation
replay_buffer = []
def get_initial_prediction(y_star, buffer, p_replay=0.5):
    if random() < p_replay and len(buffer) > 0:
        # Start from a previous partial optimization
        y_0 = buffer.pop(random_index)
    else:
        # Start from random noise
        y_0 = torch.randn_like(y_star)
    return y_0

# After optimization, store the result
replay_buffer.append(y_hat_M.detach())

Problem 2: Optimization gets trapped in local minima

Even with a replay buffer, gradient descent is deterministic — from a given starting point, it always follows the same path. If there's a local minimum between the starting point and the global minimum, the optimization gets trapped.

Solution: Langevin Dynamics. Add noise to each gradient step:

ŷ_i+1 = ŷ_i − α ∇_y E_θ(x, ŷ_i) + η_i, η_i ~ N(0, σ)

This is the same idea as simulated annealing or stochastic gradient descent: the noise allows the optimization to escape local minima by occasionally going "uphill." With the right noise level, Langevin dynamics is guaranteed to converge to the global minimum of the energy landscape.

Connection to physics: Langevin dynamics was originally developed to describe the motion of particles in a fluid — a particle drifts due to forces (the gradient term) and is buffeted by random molecular collisions (the noise term). In EBTs, the "particle" is the prediction, the "force" is the energy gradient, and the "collisions" are exploration noise. Higher temperature (σ) = more exploration. Lower temperature = more exploitation.

Problem 3: Optimization follows the same path every time

With a fixed step size α, gradient descent always takes the same-size steps. This means different training examples always explore the landscape at the same resolution, potentially missing important features.

Solution: Random Step Size and Random Number of Steps. Instead of fixed α and M, randomize both during training:

α ~ Uniform(α_min, α_max), M ~ Uniform(1, M_max)

This creates diversity in the optimization trajectories during training. Some trajectories take many small steps (fine exploration). Others take few large steps (coarse exploration). The model must learn a landscape that works well for ALL these trajectories, making it more robust at inference time.

Ablation: every technique matters

The paper provides an ablation study on the BigBench Dyck Languages benchmark (an out-of-distribution reasoning task). Results with "Thinking Longer" (more optimization steps) and "Self-Verification" (best-of-N with energy):

Configuration	Thinking Longer (%↑)	Thinking + Self-Verification (%↑)
No Random Step Size	−1.47	0.19
No Random Num Steps	0.00	9.65
No Langevin Dynamics	17.2	17.0
No Replay Buffer	14.8	17.8
Full S2 Config	7.19	18.7

The surprising finding: Removing Langevin Dynamics IMPROVES thinking-longer performance (17.2 vs 7.19) but HURTS self-verification (17.0 vs 18.7). Without noise, the landscape has sharper minima — great for single-path optimization but bad for exploration. With noise, the landscape is smoother — worse for single-path but better for generating diverse candidates for verification. The full S2 config trades single-path performance for better self-verification, which is the stronger thinking mode.

Note the first row: without random step size, thinking longer actually HURTS performance (−1.47%). This means the model learned a landscape that only works for a specific step size. Randomization during training is critical for robust thinking at inference time.

This is the SHOWCASE simulation. The energy landscape is shown with three regularization controls: Replay Buffer (toggles starting points), Langevin Dynamics (adds noise to steps), and Random Step Size (varies step magnitude). Start by running gradient descent with no regularization — watch how the prediction gets stuck in a local minimum. Then enable each technique and see how the optimization behavior changes. With all three enabled, the prediction reliably reaches the global minimum.

Why does removing Langevin Dynamics improve "thinking longer" but hurt "self-verification"?

Without noise, the landscape has sharper minima that benefit single-path optimization (thinking longer) but harm multi-candidate exploration (self-verification); noise creates smoother landscapes that support diverse candidate generation Langevin Dynamics makes the model slower Noise corrupts the energy values used for verification

Chapter 6: System 2 Thinking

We've built the architecture. We've trained it. Now let's see how EBTs actually "think" at inference time. This is where the payoff lives.

Two thinking strategies

The paper explores two complementary approaches to inference-time thinking:

Strategy 1: Thinking Longer. Run more optimization steps on a single prediction. Each step refines the prediction by following the energy gradient. More steps = more refinement = better prediction. This is like a student spending more time checking and revising a single answer.

Strategy 2: Self-Verification (Best-of-N). Generate N independent predictions (each starting from different random initializations), optimize each for M steps, then select the prediction with the LOWEST energy. This is like a student writing N different answers and submitting the one they're most confident about.

Algorithm 2: Inference with self-verification

# Algorithm 2: Inference with Verification
def infer(x, E_theta, alpha, M, N):
    best_y, best_energy = None, float('inf')

    for j in range(N):  # N independent samples
        y_hat = torch.randn(prediction_dim)  # fresh random start

        for i in range(M):  # M optimization steps
            energy = E_theta(x, y_hat)
            grad_y = torch.autograd.grad(energy, y_hat)[0]
            y_hat = y_hat - alpha * grad_y

        # Self-verification: keep the lowest energy prediction
        final_energy = E_theta(x, y_hat)
        if final_energy < best_energy:
            best_energy = final_energy
            best_y = y_hat

    return best_y

The key insight: the energy scalar serves DOUBLE DUTY. During optimization, it provides the gradient signal. After optimization, it provides the verification signal. No external verifier needed. No reward model. No fine-tuning. The model is its own critic.

Thinking Longer: results on language

The paper tests thinking longer on four Out-of-Distribution (OOD) benchmarks. Standard Transformer++ models cannot benefit from more forward passes — they make the same prediction every time (deterministic feed-forward). EBTs show up to 29% improvement with more thinking steps:

The visualization above shows the "Thinking Longer" experiment. The orange line (Transformer++) is flat — no improvement from additional forward passes, because the model can't revise its predictions. The blue curve (EBT) improves steadily as the number of forward passes increases. The X-axis is the number of forward passes (each requiring a full forward + backward through the transformer). The Y-axis is perplexity decrease on OOD tasks (lower is better).

Self-Verification: built-in Best-of-N

Traditional Best-of-N (BoN) sampling with language models requires a SEPARATE verifier or reward model. EBTs have one built in:

Generate N = 5 candidates

ŷ₁, ..., ŷ₅ (each optimized from random init for M steps)

↓

Score each with energy

E(x, ŷ₁) = 2.3, E(x, ŷ₂) = 0.8, E(x, ŷ₃) = 3.1, E(x, ŷ₄) = 1.5, E(x, ŷ₅) = 1.1

↓

Select minimum

y* = ŷ₂ (energy 0.8 — most compatible with context)

This scales with training data: at small scale (5B tokens trained), BoN-10 barely improves over BoN-2 (and sometimes hurts due to adversarial low-energy samples). At larger scale (30B tokens), BoN-10 gives significant gains, because the energy landscape becomes smoother and more reliable.

Uncertainty estimation for free

An unexpected benefit: per-token energy values reveal which tokens the model is uncertain about. Easy tokens (e.g., "the", "is", "a") converge to low energy quickly. Hard tokens (e.g., "brown", "research", "problem") maintain higher energy across steps.

The paper visualizes this with energy heatmaps across tokens and thinking steps. Common, predictable tokens show green (low energy) after 1-2 steps. Rare, context-dependent tokens remain yellow-red (high energy) even after 10 steps. This happens without any explicit uncertainty training — it emerges naturally from the energy formulation.

Why OOD performance improves MORE: The paper finds a striking linear trend: as data becomes more out-of-distribution, thinking helps MORE. On in-distribution data, thinking gives modest gains. On highly OOD data, thinking gives large gains. This mirrors human cognition — System 2 Thinking is most valuable precisely when problems are unfamiliar.

The thinking-generalization connection

Perhaps the paper's most profound finding: EBTs generalize better than Transformers++ even WITHOUT thinking at inference time. Despite having slightly worse pretraining perplexity (33.43 vs 31.36), EBTs achieve better downstream performance on 3 out of 4 benchmarks (GSM8K: 43.3 vs 49.6, SQuAD: 53.1 vs 52.3, BB Math QA: 72.6 vs 79.8, BB Dyck: 125.3 vs 131.5). With thinking, the gap widens further.

Why? The paper hypothesizes it's because verification generalizes better than generation. A verifier trained on in-distribution data can still correctly assess predictions on OOD data, because checking correctness is often independent of how the data was generated.

Why does System 2 Thinking help MORE on out-of-distribution data?

On familiar (in-distribution) data, System 1 (single forward pass) already works well; on unfamiliar data, iterative refinement via energy minimization enables the model to explore and verify, compensating for lack of training-time exposure OOD data requires more parameters The model uses a different architecture for OOD data

Chapter 7: Scaling Laws

Scaling laws are the single most predictive indicator of an architecture's future potential. If a model shows good scaling behavior at small size, it will almost certainly be competitive at large size. The EBT paper's most compelling quantitative contribution is demonstrating that EBTs scale FASTER than Transformer++ across ALL measured axes.

What is a scaling rate?

Following the Chinchilla framework, the paper models loss as a power law:

L(C) = β C^−α + E

where L is the loss, C is the compute budget (FLOPs), β is a constant, α is the scaling exponent (higher = faster improvement), and E is the irreducible entropy of the data.

In log-log space, subtracting E, this becomes a line:

log(L − E) ≈ −α log(C) + log(β)

The slope α is what matters. A steeper slope means the model extracts more performance from each additional unit of compute. The paper compares the slopes of EBTs vs. Transformer++ across six independent axes.

Learning scalability (6 axes)

The paper conducts scaling experiments on RedPajama V2 text (66M training, 33K validation samples) with the GPT-NeoX tokenizer (50,277 vocab). Results across all six axes:

Scaling Axis	EBT Faster By	How Measured
Data (# tokens)	35.98%	Validation perplexity vs. training tokens (1B-30B)
Batch size	28.66%	Val PPL vs. batch size (4K-48K tokens)
Depth (# layers)	3.29%	Val PPL vs. transformer depth (2-14 blocks)
Parameters (non-embed)	8.97%	Val PPL vs. total non-embedding parameters (6M-396M)
FLOPs	8.97%	Val PPL vs. training FLOPs
Width (embed dim)	0.62%	Val PPL vs. embedding dimension (384-2048)

The standout: 35.98% faster scaling on DATA is enormous. It means EBTs extract more learning from each additional token of training data. This is especially important given that high-quality training data is becoming scarce — EBTs are more data-efficient. The implication: at the 15T-token scale of modern foundation models, EBTs would have a significant advantage.

Note that the width scaling advantage is only 0.62% — nearly identical. This makes sense: width primarily affects the model's representational capacity, which is orthogonal to the energy-based training paradigm. The big wins are in data efficiency and depth scaling, which directly relate to how well the model shapes and traverses its energy landscape.

Understanding the scaling curves

The interactive chart above shows the scaling laws for EBTs (blue) vs. Transformer++ (orange) across the six axes. Select different axes using the buttons. Both curves follow power laws in log-log space, but the EBT line has a consistently steeper slope. The gap between the lines grows with scale — larger models show bigger advantages for EBTs.

Thinking scalability

Beyond learning scalability, the paper also measures thinking scalability — how much improvement thinking gives as a function of model scale.

The key metric is System Two Thinking (STT):

STT(x, θ, F) = E_x[ P(x, θ, F) / P(x, θ, F₀) − 1 ]

where P(x, θ, F) is the performance with F function evaluations (forward passes), and F₀ is the minimum number of evaluations. STT measures the percentage improvement from thinking.

Results show:

Self-verification capabilities improve with scale (from 4% to 14% improvement as training data increases from 5B to 30B tokens)
Projected to Llama 3 scale (15T tokens), self-verification could yield 100-1000% improvement — though this extrapolation is speculative
EBTs become LESS adversarial with scale — at small scale, BoN-10 sometimes hurts vs. BoN-2; at large scale, BoN-10 consistently helps

Video scaling

The paper also tests on continuous modalities using Something-Something V2 (video prediction). Scaling results:

Width (embedding dimension): EBTs scale 33.66% faster
Parameters (non-embedding): EBTs scale 34.28% faster

These large advantages in continuous modalities may be because EBTs naturally model continuous distributions through their energy landscape, while standard transformers must discretize continuous data (e.g., via Vector Quantization) or use proxy objectives (e.g., MSE loss), losing information in the process.

Data efficiency + generalization: EBTs' better data scaling is linked to their better generalization. The energy landscape paradigm provides a more efficient inductive bias: instead of memorizing input-output mappings, the model learns a COMPATIBILITY function that generalizes to new input-output pairs. This is analogous to how a spell-checker (verifier) works on any text, while a text-generator needs exposure to each specific writing style.

Which scaling axis shows the largest advantage for EBTs over Transformer++?

Data scaling (35.98% faster) — EBTs extract more learning from each additional token, making them significantly more data-efficient Width scaling (0.62% faster) Depth scaling (3.29% faster)

Chapter 8: Results

The paper presents results across three domains: autoregressive language modeling, bidirectional image denoising, and data-constrained Sudoku reasoning. Let's examine each with the critical eye of a researcher.

Language modeling results

All language models are pretrained on RedPajama V2 using GPT-NeoX tokenizer. Downstream evaluation uses four benchmarks of increasing difficulty:

Benchmark	Task	Transformer++ (PPL↓)	EBT (PPL↓)	Winner
GSM8K	Math word problems	31.36	33.43	Transformer++ (pretraining)
		49.6	43.3	EBT (downstream)
SQuAD	Reading comprehension	52.3	53.1	EBT
BB Math QA	Math reasoning	79.8	72.6	EBT
BB Dyck	Bracket matching (OOD)	131.5	125.3	EBT

The generalization paradox: EBTs have WORSE pretraining perplexity (33.43 vs 31.36) but BETTER downstream performance on most tasks. This means EBTs are not simply better at memorizing the training distribution — they learn more generalizable representations. The paper attributes this to the verification paradigm: a verifier trained on in-distribution data can assess OOD predictions, while a generator trained on in-distribution data can only produce in-distribution outputs.

Image denoising results

For continuous modalities, the paper trains bidirectional EBTs on COCO 2014 (128×128 images) using the SD-XL VAE for latent encoding. The comparison is against Diffusion Transformers (DiTs):

Metric	DiT	EBT	EBT Advantage
In-Dist PSNR ↑	26.58	27.25	+0.67 dB
In-Dist MSE Pixel ↓	142.98	122.55	−14.3%
OOD PSNR ↑	19.56	23.29	+3.73 dB
OOD MSE Pixel ↓	718.7	305.2	−57.5%
ImageNet Top-1 Acc ↑	0.31%	5.32%	~17×
ImageNet Top-5 Acc ↑	1.36%	13.2%	~10×
Forward passes needed	100-300	1-3	99% fewer

The image results are striking in three ways:

1. Better denoising with 99% fewer passes. EBTs need 1-3 forward passes for comparable-or-better results vs. DiTs' 100-300 denoising steps. This is because EBTs directly minimize energy (a scalar), while DiTs predict noise at each step (a high-dimensional output).

2. Massively better OOD denoising. The OOD PSNR gap (+3.73 dB) is huge in image quality terms. This confirms the verification generalization hypothesis from the language experiments.

3. 10-17× better image classification. Linear probe classification accuracy on ImageNet-1K shows EBTs learn dramatically more useful representations. DiTs learn to predict noise; EBTs learn to understand images.

Why such better representations? DiTs are trained to predict noise — a randomly sampled Gaussian — added to the image. Their representations are optimized for noise estimation, not image understanding. EBTs are trained to verify compatibility between noisy and clean images in the INPUT SPACE. Their representations must capture what makes an image "correct" — which requires deeper understanding of image structure.

Sudoku: data-constrained reasoning

To test generalization in data-limited settings, the paper follows Du et al. (2024) in training models on Sudoku from the SAT-Net / RRN datasets. Given a partially filled board (1-9 digits filled), predict the complete solution.

Architecture	Test Accuracy
Feed-Forward Transformer	0.03%
RNN	17.7%
EBT	29.7%

The feed-forward transformer essentially fails (0.03%). It can't do multi-step constraint reasoning in a single pass. The RNN does better by iterating its hidden state, but still struggles. The EBT, by optimizing its prediction to satisfy all Sudoku constraints simultaneously (each constraint contributing to the energy), achieves 29.7% — 67% better than the RNN and 990× better than the transformer.

What EBTs CANNOT do (limitations)

The paper is commendably honest about limitations:

Multimodal distributions: On COCO image generation (many modes per caption), EBTs produce blurry images — the energy landscape merges multiple modes into one average. This is a fundamental limitation of single-minimum optimization.
Computational cost: Each optimization step requires a full forward+backward pass. With M=2 steps, EBTs are ~3.33× more expensive per token during training.
Scale: The largest model is 708M parameters (800M total). Modern foundation models are 10-100× larger. The scaling trends are promising but unverified at large scale.
Training stability: S2 models with >3 optimization steps became unstable during training. Future work may need better second-order optimization to extend thinking depth.

The simulation above compares EBT vs. DiT on image denoising. On the left, a DiT progressively removes noise over many steps. On the right, an EBT minimizes energy in just a few steps, converging to a cleaner result. Adjust the "noise level" slider to see how both methods handle different corruption levels. Notice how the gap widens for higher noise (more OOD) — EBTs degrade more gracefully.

Why do EBTs learn better image representations than Diffusion Transformers despite both operating in the same latent space?

DiTs learn to predict random noise (a low-level task), while EBTs learn to verify image-level compatibility (requiring higher-level understanding of image structure) EBTs use a larger model EBTs are trained on more data

Chapter 9: Connections

Cheat sheet: every key equation

Equation	What It Does	Symbols
E_θ(x, y) → R	Energy function: assigns compatibility score	x = context, y = candidate prediction, θ = model params
ŷ_i+1 = ŷ_i − α ∇_y E_θ(x, ŷ_i)	Gradient descent on predictions (thinking step)	α = step size, ∇_y = gradient w.r.t. prediction
ŷ_i+1 = ŷ_i − α ∇_y E + η, η ~ N(0, σ)	Langevin dynamics (thinking with exploration)	σ = noise magnitude, η = exploration noise
p_θ(y\|x) = e^−E_θ(x,y) / Z(θ)	Boltzmann distribution (theoretical, not computed)	Z(θ) = partition function (intractable)
L(C) = β C^−α + E	Scaling law: loss vs. compute	α = scaling exponent, C = FLOPs, E = irreducible entropy
STT(x, θ, F) = E_x[P(x,θ,F)/P(x,θ,F₀) − 1]	System 2 Thinking metric	F = forward passes, F₀ = minimum forward passes
FLOPs_EBT ≈ M × 10N × 2	Per-step compute cost (M optimization steps)	N = non-embedding params, M = optimization steps

EBTs vs. Diffusion Models: a detailed comparison

EBTs and diffusion models are closely related — the paper argues that diffusion models are a special case of (implicit) EBMs. The key differences:

Property	Diffusion Models	EBTs
Supervision	At every timestep (noise prediction)	Only at the end (final prediction vs. target)
Update rule	Predict noise, follow denoising schedule	Gradient descent on energy (free-form)
Verification	Implicit (no energy scalar)	Explicit (energy scalar)
# Steps	Fixed schedule (100-1000)	Dynamic (1-N, any number)
Self-verification	Requires external model	Built-in (compare energies)
Discrete data	Requires discretization tricks	Works natively
Uncertainty	Not directly modeled	Energy = uncertainty

The connection is deepened by the paper's observation that both diffusion models and EBMs predict the gradient of the data density. Diffusion models learn ∇_x log p(x_t | x₀) (score function). EBMs learn ∇_y E_θ(x, y) (energy gradient). Both use these gradients to iteratively refine predictions.

The Reversal Curse connection

The paper makes a fascinating connection to the "Reversal Curse" in LLMs — the phenomenon where models trained on "A is B" fail to learn "B is A." In standard transformers, only A's tokens receive gradient updates during the prediction of B (because B is in the output space). In EBTs, BOTH A and B are in the input space, so both receive gradient updates. This could fundamentally resolve the asymmetric learning problem.

Related work in context

Work	Relation to EBTs
Hopfield Networks (1982)	Energy-based model for associative memory; EBTs scale this concept with transformers
GANs (2014)	Separate generator+discriminator; EBTs unify them in one model
Du & Mordatch (2019)	Pioneered optimization-based EBM training; EBTs scale it to transformers
DiT (2023)	Transformer backbone for diffusion; EBT bidirectional variant builds on DiT
o1 / DeepSeek-R1	System 2 via RL + verifiers; EBTs achieve it via energy minimization alone
LeCun's JEPA (2022)	Energy-based architecture for autonomous intelligence; EBTs are a concrete realization
Scaling Laws (Kaplan 2020, Chinchilla 2022)	EBTs follow and improve upon established scaling law frameworks

Open questions and future directions

Scale: Will the 35% faster scaling hold at 100B+ parameters? The paper's largest model is 708M. The scaling TRENDS suggest yes, but this is unverified.
Multimodal EBTs: A single energy scalar could represent compatibility across text, images, and actions simultaneously — a natural unified objective for multimodal learning.
World models: EBTs trained on state-action pairs could serve as implicit world models, with energy minimization generating action sequences that reach desired states.
Recurrent EBTs: Combining EBTs with state-space models (Mamba) for latency-sensitive applications where Transformers' O(n²) attention is prohibitive.
Training stability at depth: The paper could only train with 2-3 optimization steps. Extending to 10+ steps (deeper thinking during training) requires advances in second-order optimization.
FLOP efficiency: At inference, EBTs lag feed-forward transformers by 3-6× in FLOPs per token. Hardware-aware optimizations (fused kernels for HVPs) could narrow this gap.

The big picture

EBTs represent a fundamental rethinking of how neural networks should work. Instead of the feedforward paradigm (input → one pass → output), EBTs propose an optimization paradigm (input + guess → iterate until convergence → verified output). This aligns with how humans actually think: we don't arrive at answers in one step. We hypothesize, check, revise, and iterate.

The paper demonstrates that this paradigm is not just theoretically appealing but practically competitive — scaling faster than the dominant Transformer++ approach while adding capabilities (verification, dynamic computation, uncertainty estimation) that current models fundamentally lack.

The deepest implication: If verification truly is easier than generation, then the entire field may be doing things the hard way. Training models to generate directly (softmax over vocabulary) is asking them to solve a harder problem than necessary. Training them to verify (scalar energy score) and generate by optimization may be the more natural — and more scalable — approach. EBTs are the first concrete evidence that this is true at scale.

What is the fundamental difference between how EBTs and diffusion models receive supervision during training?

Diffusion models receive supervision at every timestep (predicting noise at each denoising step), while EBTs receive supervision only at the END of the optimization process (comparing the final prediction to the target) Diffusion models use unsupervised learning while EBTs use supervised learning There is no difference — both use the same loss function

Energy-Based Transformers are Scalable Learners and Thinkers