Gladstone, Nanduru, Islam et al. — UVA / UIUC / Harvard, ICLR 2026 Oral

Energy-Based Transformers are Scalable Learners and Thinkers

What if your model didn't just predict — but verified its predictions by minimizing energy? EBTs assign a scalar energy to every input-prediction pair and think by gradient descent on that landscape, enabling System 2 reasoning from unsupervised learning alone.

Prerequisites: Transformers + Energy-Based Models + Gradient Descent + Scaling Laws
10
Chapters
8+
Simulations

Chapter 0: The Problem

You're a student taking a math exam. You write down an answer to a problem — but before moving on, you pause. You check your work. You substitute your answer back into the original equation. Does it satisfy the constraints? If not, you revise. You keep checking until you're confident.

This is System 2 Thinking — the slow, deliberate, effortful reasoning that psychologists Daniel Kahneman and Amos Tversky distinguished from System 1 (fast, automatic, intuitive). System 1 is catching a ball. System 2 is computing 17 × 24 in your head.

Modern AI has gotten remarkably good at System 1. A GPT-class model sees "The dog caught the ___" and instantly outputs "ball" in a single feed-forward pass. No checking. No revision. Just pattern matching at light speed.

But what happens when the problem is hard? When the model encounters something it's never seen during training? When the correct answer requires multi-step reasoning?

The fundamental mismatch: Current feed-forward transformers allocate the SAME amount of computation to every prediction — whether it's predicting "the" after "at" (trivial) or solving a novel math problem (hard). Humans allocate MORE effort to harder problems. Feed-forward transformers cannot.

Recent attempts at System 2 Thinking in AI (o1, DeepSeek-R1, Chain of Thought) have shown exciting results, but they share critical limitations:

ApproachDynamic Compute?Self-Verification?ModalityRequires Supervision?
Chain of ThoughtPartial (more tokens)NoText onlyNo, but unreliable
o1 / DeepSeek-R1Yes (more tokens)ImplicitText onlyYes (RL + verifiers)
Best-of-N SamplingYesRequires external verifierText onlyYes (trained verifier)
Diffusion ModelsYes (more steps)No explicit verificationContinuous onlyNo
EBTs (this paper)Yes (per prediction)Yes (energy scalar)Both discrete & continuousNo — unsupervised

Notice the bottom row. EBTs are the only approach that has all four properties: dynamic computation per prediction, built-in self-verification, works across modalities, and requires no additional supervision beyond standard pretraining.

That's an extraordinary claim. To understand why it works, we need to step back and ask: what does it mean for a model to "verify" its own predictions?

Two facets of human thinking

The paper identifies two cognitive facets that current models mostly lack:

Facet 1: Dynamic Allocation of Computation. Deciding whether to change careers takes more thought than deciding what to eat for lunch. Humans naturally allocate varying amounts of effort depending on difficulty. Feed-forward transformers use the same depth (same FLOPs) for every single prediction. RNNs and diffusion models can vary computation, but lack explicit verification.

Facet 2: Verification of Predictions. Humans don't just generate answers — they check them. A student doesn't just write "x = 5" and move on; they substitute back into the equation to verify. Standard transformers have no mechanism to assess the quality of their own outputs. They predict in the output space, not in a space where the prediction can be compared against the input.

The canvas above shows the key architectural difference. A standard autoregressive transformer maps context to a prediction in one pass. An EBT takes a candidate prediction, feeds it alongside the context, and outputs an energy scalar — a single number indicating how compatible this prediction is with the context. Lower energy = better match.

Now prediction becomes optimization: start with a random guess, compute the energy, take the gradient of energy with respect to the prediction, update the prediction to lower the energy, and repeat. Each iteration is a "thinking step." More steps = more thinking = better prediction.

The core question of this paper: "Can we rely entirely on unsupervised learning to develop System 2 Thinking?" The answer is yes — by learning a verifier (the energy function) instead of a generator, and then generating by optimizing with respect to the verifier.
Why can't standard feed-forward transformers perform System 2 Thinking?

Chapter 1: The Key Insight

Here's a fundamental asymmetry that has been hiding in plain sight for decades: verification is easier than generation.

Think about it. Given a completed Sudoku puzzle, you can verify it in seconds — check each row, column, and 3×3 block. But solving a Sudoku from scratch? That takes minutes of careful reasoning. Given a proof, a mathematician can verify each step far more easily than they could have discovered the proof in the first place. Given a solution to an NP-hard problem, you can check it in polynomial time, but finding it requires exponential time.

This asymmetry is well-known in complexity theory (P vs NP), cryptography (verifying a signature vs. forging one), and everyday life (proofreading vs. writing). Yet current AI models do the HARD thing — they generate directly in one shot — and never attempt the EASY thing: verification.

The paradigm shift: Instead of training a model to GENERATE predictions (hard), train it to VERIFY compatibility between inputs and candidate predictions (easier). Then generate by optimizing candidates to minimize the verifier's energy. The verifier is the model; generation is just optimization.

This is the core idea of Energy-Based Models (EBMs). An EBM is a function Eθ(x, y) that takes an input x (context) and a candidate prediction y, and outputs a single scalar: the energy. Low energy means "y is a good prediction given x." High energy means "y is incompatible with x."

The verification-generation duality

Let's make this concrete with next-token prediction. In a standard transformer:

Input
Context tokens: "The dog caught the"
Forward pass (one shot)
Softmax over vocabulary → probability for each token
Output
Most likely token: "ball" (p = 0.43)

In an EBT:

Input
Context: "The dog caught the" + Candidate: [random token distribution]
Forward pass
Energy scalar: E = 8.3 (high → bad match)
Gradient ∇y E
Direction to update the candidate to lower energy
Update candidate
y ← y − α ∇y E  (gradient descent on prediction)
↓ repeat N times
Converged
Energy: E = 0.7 (low → good match). Final prediction: "ball"

The EBT did something remarkable: it ran gradient descent on its own predictions. Not on the model weights — those are frozen at inference time. On the predictions themselves. Each iteration is a "thinking step" where the model refines its guess to better match the context.

Why this enables System 2 Thinking

This formulation naturally gives us both cognitive facets:

Dynamic computation (Facet 1): Easy predictions converge in 2-3 steps. Hard predictions need 10+ steps. The model automatically allocates more computation to harder problems, because harder problems have more complex energy landscapes that take longer to optimize.

Self-verification (Facet 2): The energy value itself IS the verification. After thinking, the model can check: "Is this energy low enough? Should I keep thinking?" It can also generate multiple candidates and pick the one with the lowest energy — a form of best-of-N sampling with a built-in verifier, requiring no external model.

The simulation above shows the verification-generation asymmetry in action. On the left, a standard transformer generates in one pass — fast but can't improve. On the right, an EBT starts from a random prediction and iteratively refines it by minimizing energy. Watch how the EBT's prediction progressively shifts toward the correct distribution over multiple steps.

Analogy: Think of the EBT as a student who doesn't just blurt out answers (System 1) but has an internal "confidence meter" (energy) and keeps revising until the meter reads "high confidence" (low energy). The more confused the student is initially (high energy), the more revision steps they take. The revision process IS the thinking.

A concrete numerical example

Let's trace what happens when an EBT predicts the next token for the context "The dog caught the ___". The vocabulary has 50,277 tokens. The prediction y is a distribution over these tokens (a vector of 50,277 logits).

Step 0 (initialization): y0 ~ N(0, I). All 50,277 logits drawn from a standard Gaussian. This is pure noise — essentially uniform over the vocabulary.

Step 0 energy: Eθ(x, y0) = 14.2. Very high. The random prediction is incompatible with the context.

Step 1: Compute ∇y Eθ(x, y0). This 50,277-dimensional gradient tells us how to adjust each logit to reduce energy. Update: y1 = y0 − α ∇y E. Energy drops to 8.7.

Step 2: Repeat. y2 = y1 − α ∇y Eθ(x, y1). Energy drops to 3.1. The distribution is starting to concentrate on plausible tokens: "ball", "frisbee", "stick".

Step 3: y3 = y2 − α ∇y Eθ(x, y2). Energy drops to 0.9. Strong peak at "ball".

Each step required a full forward AND backward pass through the transformer to compute E and ∇y E. Three thinking steps = 6 passes (3 forward + 3 backward). This is 6× more compute than a standard transformer's single forward pass. But the model gets to THINK — and the result is better.

In an EBT, what does gradient descent optimize at inference time?

Chapter 2: Energy-Based Models Background

Before we dive into the architecture, we need solid footing on Energy-Based Models. The concept has been around for decades — Hopfield networks (1982), Boltzmann machines (1985), contrastive learning — but making them work at scale has been an open problem until now.

What is an energy function?

An energy function Eθ(x, y) is just a neural network that takes two inputs — some context x and a candidate prediction y — and outputs a single scalar. That's it. No softmax. No probability distribution. Just a number.

The interpretation is physical: think of a ball on a hilly landscape. The ball naturally rolls to the lowest point — the minimum energy state. High hills = high energy = unlikely configurations. Low valleys = low energy = likely configurations.

Eθ : X × Y → R

For a given context x, the energy function defines a landscape over all possible predictions y. The ground truth prediction y* should sit at the bottom of a valley — a local (ideally global) minimum of Eθ(x, ·).

From energy to probability (and why we avoid it)

You CAN convert an energy function into a probability distribution using the Boltzmann distribution:

pθ(y | x) = e−Eθ(x, y) / Z(θ)

where Z(θ) = ∫ e−Eθ(x, y) dy is the partition function — an integral over ALL possible predictions y.

Here's the problem: computing Z(θ) requires integrating over every possible y. For a vocabulary of 50,277 tokens, that's a sum over 50,277 terms (doable). But for continuous predictions like images (256 × 256 × 3 real-valued pixels), that integral is over a space of dimension 196,608. Completely intractable.

Key decision: EBTs use unnormalized EBMs. They never compute the partition function Z. They don't model probabilities at all — just relative energies. The training objective shapes the landscape to have low energy near true data and high energy everywhere else, WITHOUT needing to know the normalization constant. This is what makes EBMs scalable.

Two approaches to training EBMs

Historically, there have been two main approaches, and understanding why one fails at scale is crucial to understanding the EBT paper's contribution:

Approach 1: Contrastive methods. Push down the energy of "positive" (real) samples while pushing up the energy of "negative" (fake) samples. The catch: to push up negative energy everywhere in a high-dimensional space, you need an exponentially growing number of negatives. This is the curse of dimensionality — contrastive methods don't scale.

Approach 2: Optimization-based training. Train the EBM so that gradient descent FROM A RANDOM STARTING POINT converges to the ground truth y*. This implicitly shapes the landscape to have a valley at y* without needing negative samples. This is what EBTs use.

Optimization-based training in detail

Let's derive the training procedure step by step. We have:

Forward pass (energy minimization on the prediction):

Start with a random prediction: y0 ~ N(0, I)

For i = 0, 1, ..., M − 1:

i+1 = ŷi − α ∇y Eθ(x, ŷi)

This is gradient descent on the prediction. At each step, we compute the gradient of the energy WITH RESPECT TO y (not θ), and update y to reduce the energy. After M steps, we have a refined prediction ŷM.

Backward pass (update model parameters):

Compute a loss comparing ŷM to the ground truth y*:

L = J(ŷM, y*)

where J can be any standard loss (cross-entropy for discrete, MSE for continuous). Now backpropagate THROUGH THE ENTIRE OPTIMIZATION PROCESS to update θ.

The critical insight: The loss is backpropagated THROUGH all M gradient descent steps. This means the model learns parameters θ such that gradient descent on its energy landscape will lead from random initialization to the correct prediction. The model is not just learning to assign low energy to the right answer — it's learning an energy landscape whose GRADIENT FLOW leads to the right answer.

Why this avoids the curse of dimensionality

Contrastive methods must explicitly push up energy in an exponentially large space. The optimization approach does something more subtle: by training the model so that gradient descent converges to y*, it implicitly regularizes the landscape to have a single local minimum near y*. The landscape is shaped by the optimization dynamics, not by explicit negative sampling.

Think of it like sculpting. Contrastive methods try to chip away at every piece of marble that isn't the statue (exponential work in high dimensions). The optimization approach shapes the landscape so that water (gradient descent) naturally flows to the statue (the answer). You only need to shape the channels, not remove all the marble.

Second-order gradients: the hidden cost

Here's a subtlety that's easy to miss. During the forward pass, we compute ∇y Eθ — a gradient of E with respect to y. During the backward pass, we differentiate the LOSS (which depends on these gradients) with respect to θ. This is a gradient of a gradient — a second-order derivative.

Computing full Hessians is O(n²) in the number of parameters. But EBTs use Hessian-vector products (HVPs), which can be computed in O(n) time — the same cost as a standard backward pass. In PyTorch, this is implemented via torch.autograd.grad with create_graph=True, which keeps the computation graph alive for a second differentiation.

The total cost per training step with M optimization steps is approximately:

FLOPsEBT ≈ M × (Forward + Backward + Backward) × 2 ≈ M × 10N × 2

where N is the number of non-embedding parameters. With M = 2 (the paper's default), this is roughly 3.33× more expensive per training step than a standard Transformer++ step (which costs 6N FLOPs).

The simulation above lets you explore an energy landscape. The blue dot is the current prediction, and the red star is the ground truth. Click "Step" to perform one gradient descent step on the prediction. Watch how the prediction moves downhill toward the energy minimum. The landscape is shaped so that the minimum coincides with the ground truth — this is what training achieves.

Why do EBTs use optimization-based training instead of contrastive training?

Chapter 3: The EBT Architecture

Now that we understand the energy-based paradigm, let's see how to actually build a Transformer that functions as an energy model. This is where EBTs differ from standard transformers in subtle but critical ways.

Two variants: autoregressive and bidirectional

The paper introduces two EBT architectures:

Autoregressive EBT (AR-EBT): A GPT-style decoder-only transformer for next-token prediction. Uses causal attention. This is the primary variant for language modeling experiments.

Bidirectional EBT (Bi-EBT): A BERT-style transformer with full bidirectional attention. Used for image denoising and classification experiments. Built on the DiT (Diffusion Transformer) architecture.

The key difference from standard transformers

In a standard transformer, the model takes context x and produces a prediction in the OUTPUT space — it generates y directly. In an EBT, the model takes BOTH x and a candidate y and produces an energy scalar. This means predictions must live in the INPUT space, not the output space.

Standard Transformer
Input: x = [x1, ..., xn] → Output: logits for xn+1 (prediction in output space)
vs
AR-EBT
Input: [x1, ..., xn, ŷn+1] → Output: Eθ(x, ŷ) scalar (candidate prediction in input space)

This difference has profound consequences for the attention mechanism.

The attention challenge in autoregressive EBTs

In a standard causal transformer with n tokens, the attention mask is lower-triangular: token i can attend to tokens 1 through i. After the causal mask, the n × n attention scores matrix looks like:

scores = [αi,j] where αi,j = 0 for j > i

For an EBT, we have n past (context) tokens z1, ..., zn and we're predicting future tokens ŷn+1, ..., ŷn+k. The attention matrix must be (n+k) × (n+1) and satisfy special constraints:

Rule 1: Each past token zi attends to all previous past tokens z1, ..., zi (standard causal attention).

Rule 2: Each predicted future token ŷj attends to ALL past tokens z1, ..., zn (it can see the full context).

Rule 3: Each predicted future token ŷj attends to ITSELF (to incorporate its own representation), but NOT to other predicted tokens.

Rule 3 is the tricky part. In standard attention, the diagonal comes for free from the Q KT computation. But here, each predicted token ŷj is a DIFFERENT prediction (different gradient descent trajectory), so the "self-attention" on the diagonal can't be computed with a single matrix multiply.

Why this matters for parallelism: During training, you want to predict ALL next tokens in parallel (like standard language model training). But each predicted token ŷj has its own evolving representation that can't share information with other predicted tokens. The paper solves this with a clever masking scheme that separates the attention into two computations: one for past tokens (standard) and one for predicted tokens (requires extracting and replacing the superdiagonal).

The efficient attention implementation

The paper splits the sequence into two groups:

For the context tokens, attention is computed normally:

Attention(Qo, Ko, Vo)zn1 = softmax(Qo KoT / √dk) Vo

For the predicted tokens, the paper computes separate Qp, Kp, Vp matrices and constructs the attention in four steps:

  1. Compute unnormalized scores: Qp KoT / √dk (predicted queries attending to context keys)
  2. Extract the superdiagonal (each predicted token's self-attention score) using: self_attention = sum(Qp * Kp, dim=head_dim) — a Hadamard product, not a matrix multiply
  3. Replace the superdiagonal in the scores matrix with these self-attention values
  4. Apply softmax and multiply by [Vo ; Vp] to get updated representations

This scheme effectively doubles the sequence length (from n to 2n) but thanks to the efficient masking, the total FLOPs are approximately 2× a standard transformer, not 4×.

The energy head

After the transformer blocks, the EBT needs to produce a single energy scalar. For autoregressive EBTs:

Last hidden state of each predicted token
hn+1, ..., hn+k ∈ Rd
Linear projection
ej = Wenergy hj + b ∈ R (one scalar per predicted token)
Sum energies
Eθ(x, ŷ) = ∑j ej

The energy is the sum of per-token energy scalars. This means each token contributes independently to the total energy, allowing the model to identify WHICH tokens are problematic (high individual energy) — this is what enables the uncertainty visualizations in the paper's results.

Full architecture diagram

The diagram above shows the complete data flow of an autoregressive EBT. Context tokens (green) flow through normal causal attention. Predicted tokens (orange) are initialized randomly and attend to all context tokens plus themselves. Each transformer block updates both representations. The final energy head produces a scalar per predicted token, which are summed into a total energy.

Model sizes

SizeNon-Embed Params# LayersEmbed Dim# Heads
XXS6.18M63846
XS12.4M123846
Small48.8M1276812
Medium176M24102416
Large396M24153616
XL708M24204832

The architecture follows the Llama 2 / Transformer++ recipe: RMSNorm, RoPE position embeddings, SwiGLU activation, no bias terms. The only additions are the energy head and the modified attention scheme for predicted tokens.

Why can't each predicted token attend to other predicted tokens in an autoregressive EBT?

Chapter 4: Training EBTs

We've seen the architecture. Now the hard part: actually training these models to produce well-shaped energy landscapes. This chapter covers the training algorithms in detail, including the GAN-like duality that makes EBTs work.

Algorithm 1: Training

Let's walk through a single training step, line by line. We have a training example (x, y*), the EBM Eθ, step size α, number of optimization steps M, and a loss function J.

# Algorithm 1: EBT Training
def train_step(x, y_star, E_theta, alpha, M, loss_fn):
    # Step 1: Initialize prediction from random noise
    y_hat = torch.randn_like(y_star)  # y_0 ~ N(0, I)
    y_hat.requires_grad_(True)

    # Step 2: Run M gradient descent steps on the prediction
    for i in range(M):
        energy = E_theta(x, y_hat)
        # Gradient of energy w.r.t. prediction (NOT params)
        grad_y = torch.autograd.grad(
            energy, y_hat, create_graph=True
        )[0]
        y_hat = y_hat - alpha * grad_y

    # Step 3: Compute loss against ground truth
    loss = loss_fn(y_hat, y_star)

    # Step 4: Backpropagate through EVERYTHING
    # (including all M gradient steps) to update theta
    loss.backward()
    optimizer.step()

The crucial detail is create_graph=True on line 10. Without this flag, PyTorch discards the computation graph after computing the gradient, making it impossible to backpropagate through the optimization steps. With it, the entire M-step optimization trajectory is part of the computation graph.

The GAN analogy: During the FORWARD pass (steps 1-2), the EBM acts as a GAN discriminator — it evaluates the compatibility of the candidate prediction with the context. During the BACKWARD pass (steps 3-4), the optimization process acts as a GAN generator — the gradients flow through the optimization steps to update θ so that gradient descent produces predictions closer to y*. Unlike GANs, the verifier and generator are the SAME model, avoiding adversarial instability.

Worked example: one training step

Let's trace through with concrete numbers. Suppose we're training on a tiny vocabulary of 5 tokens: {cat, dog, the, runs, fast}. The context is "the dog ___" and the ground truth next token is "runs" (one-hot: [0, 0, 0, 1, 0]).

Step 0:0 = [0.3, −0.7, 1.2, −0.4, 0.8] (random Gaussian). Energy: E = 12.5.

Gradient:y E = [0.1, −0.3, 0.6, −0.9, 0.5]. The model "knows" it should push probability toward index 3 (runs) — that dimension has the most negative gradient.

Update (alpha=0.5):1 = [0.3 − 0.05, −0.7 + 0.15, 1.2 − 0.3, −0.4 + 0.45, 0.8 − 0.25] = [0.25, −0.55, 0.9, 0.05, 0.55]

Step 1 energy: E = 6.8. Better, but still high.

Step 2: Another gradient step. ŷ2 = [0.1, −0.3, 0.4, 0.8, 0.2]. Energy: 2.1. Now "runs" has the highest logit.

Loss: Cross-entropy between softmax(ŷ2) = [0.12, 0.08, 0.16, 0.40, 0.13] and one-hot [0, 0, 0, 1, 0]. Loss = −log(0.40) = 0.92.

Backprop: This loss gradient flows backward THROUGH both gradient descent steps AND the energy function to update θ. The model learns to shape Eθ so that future gradient descent starting from random noise will converge more quickly to "runs".

The S1 vs S2 training modes

The paper discovers an important distinction between two training configurations:

S1 (System 1) models: Optimized for stability and learning convergence. The gradients of predictions are DETACHED between optimization steps — backpropagation doesn't flow through the entire chain. Simpler, more stable, but weaker thinking capabilities.

S2 (System 2) models: Full backpropagation through all optimization steps (no detaching). Include all energy landscape regularization techniques (next chapter). More expensive, but enable genuine System 2 Thinking.

PropertyS1 ModelsS2 Models
Gradient flowDetached between stepsFull backprop through all steps
Replay bufferNoYes
Langevin dynamicsNoYes
Random step sizeNoYes
Training stabilityHigherLower (more careful tuning)
Learning scaling rateBaseline~3.3% faster
Thinking abilityLimitedStrong (up to 29% improvement)

The paper finds that S1 and S2 models have similar scaling rates — the S2 scaling curve is just shifted up (higher initial loss but same slope). This means you can choose S1 for pure pretraining and switch to S2 when you need thinking capabilities, without losing scaling behavior.

The simulation above lets you watch EBT training in action. The energy landscape (blue curve) starts flat and gradually develops a minimum at the ground truth location (red star). Each training iteration updates θ so that the landscape's gradient flow points more strongly toward the target. Toggle between S1 and S2 modes to see the difference in landscape regularization.

Implementation detail: learnable step size

The optimization step size α has a major impact on training. Too large: predictions overshoot and oscillate. Too small: convergence is slow and the model wastes optimization steps. The paper makes α learnable — it's a trainable parameter that the model optimizes alongside its weights. For S2 models, the step size is multiplied by 1500 relative to S1, which the paper finds is necessary for the full backpropagation to produce useful gradients.

What does create_graph=True enable in EBT training?

Chapter 5: Shaping the Energy Landscape

Training an EBM to have well-shaped energy landscapes in high-dimensional space is like trying to sculpt a mountain range where every valley leads to the correct answer. It's hard. The paper identifies three regularization techniques that are critical for enabling System 2 Thinking. Each one addresses a specific failure mode.

Problem 1: The landscape has dead zones

Without regularization, the energy landscape tends to be well-shaped only in a narrow region around the ground truth. Start gradient descent from far away, and you hit flat regions where gradients vanish — the optimization gets stuck. This means thinking for more steps doesn't help.

Solution: Replay Buffer. Instead of always initializing predictions from random noise N(0, I), maintain a buffer of partially-optimized predictions from previous training steps. Occasionally initialize FROM THESE rather than from scratch. This forces the model to learn landscapes that are well-shaped even far from the optimum, because the training loss depends on successfully navigating from these diverse starting points.

# Replay buffer implementation
replay_buffer = []
def get_initial_prediction(y_star, buffer, p_replay=0.5):
    if random() < p_replay and len(buffer) > 0:
        # Start from a previous partial optimization
        y_0 = buffer.pop(random_index)
    else:
        # Start from random noise
        y_0 = torch.randn_like(y_star)
    return y_0

# After optimization, store the result
replay_buffer.append(y_hat_M.detach())

Problem 2: Optimization gets trapped in local minima

Even with a replay buffer, gradient descent is deterministic — from a given starting point, it always follows the same path. If there's a local minimum between the starting point and the global minimum, the optimization gets trapped.

Solution: Langevin Dynamics. Add noise to each gradient step:

i+1 = ŷi − α ∇y Eθ(x, ŷi) + ηi,   ηi ~ N(0, σ)

This is the same idea as simulated annealing or stochastic gradient descent: the noise allows the optimization to escape local minima by occasionally going "uphill." With the right noise level, Langevin dynamics is guaranteed to converge to the global minimum of the energy landscape.

Connection to physics: Langevin dynamics was originally developed to describe the motion of particles in a fluid — a particle drifts due to forces (the gradient term) and is buffeted by random molecular collisions (the noise term). In EBTs, the "particle" is the prediction, the "force" is the energy gradient, and the "collisions" are exploration noise. Higher temperature (σ) = more exploration. Lower temperature = more exploitation.

Problem 3: Optimization follows the same path every time

With a fixed step size α, gradient descent always takes the same-size steps. This means different training examples always explore the landscape at the same resolution, potentially missing important features.

Solution: Random Step Size and Random Number of Steps. Instead of fixed α and M, randomize both during training:

α ~ Uniform(αmin, αmax),   M ~ Uniform(1, Mmax)

This creates diversity in the optimization trajectories during training. Some trajectories take many small steps (fine exploration). Others take few large steps (coarse exploration). The model must learn a landscape that works well for ALL these trajectories, making it more robust at inference time.

Ablation: every technique matters

The paper provides an ablation study on the BigBench Dyck Languages benchmark (an out-of-distribution reasoning task). Results with "Thinking Longer" (more optimization steps) and "Self-Verification" (best-of-N with energy):

ConfigurationThinking Longer (%↑)Thinking + Self-Verification (%↑)
No Random Step Size−1.470.19
No Random Num Steps0.009.65
No Langevin Dynamics17.217.0
No Replay Buffer14.817.8
Full S2 Config7.1918.7
The surprising finding: Removing Langevin Dynamics IMPROVES thinking-longer performance (17.2 vs 7.19) but HURTS self-verification (17.0 vs 18.7). Without noise, the landscape has sharper minima — great for single-path optimization but bad for exploration. With noise, the landscape is smoother — worse for single-path but better for generating diverse candidates for verification. The full S2 config trades single-path performance for better self-verification, which is the stronger thinking mode.

Note the first row: without random step size, thinking longer actually HURTS performance (−1.47%). This means the model learned a landscape that only works for a specific step size. Randomization during training is critical for robust thinking at inference time.

This is the SHOWCASE simulation. The energy landscape is shown with three regularization controls: Replay Buffer (toggles starting points), Langevin Dynamics (adds noise to steps), and Random Step Size (varies step magnitude). Start by running gradient descent with no regularization — watch how the prediction gets stuck in a local minimum. Then enable each technique and see how the optimization behavior changes. With all three enabled, the prediction reliably reaches the global minimum.

Why does removing Langevin Dynamics improve "thinking longer" but hurt "self-verification"?

Chapter 6: System 2 Thinking

We've built the architecture. We've trained it. Now let's see how EBTs actually "think" at inference time. This is where the payoff lives.

Two thinking strategies

The paper explores two complementary approaches to inference-time thinking:

Strategy 1: Thinking Longer. Run more optimization steps on a single prediction. Each step refines the prediction by following the energy gradient. More steps = more refinement = better prediction. This is like a student spending more time checking and revising a single answer.

Strategy 2: Self-Verification (Best-of-N). Generate N independent predictions (each starting from different random initializations), optimize each for M steps, then select the prediction with the LOWEST energy. This is like a student writing N different answers and submitting the one they're most confident about.

Algorithm 2: Inference with self-verification

# Algorithm 2: Inference with Verification
def infer(x, E_theta, alpha, M, N):
    best_y, best_energy = None, float('inf')

    for j in range(N):  # N independent samples
        y_hat = torch.randn(prediction_dim)  # fresh random start

        for i in range(M):  # M optimization steps
            energy = E_theta(x, y_hat)
            grad_y = torch.autograd.grad(energy, y_hat)[0]
            y_hat = y_hat - alpha * grad_y

        # Self-verification: keep the lowest energy prediction
        final_energy = E_theta(x, y_hat)
        if final_energy < best_energy:
            best_energy = final_energy
            best_y = y_hat

    return best_y

The key insight: the energy scalar serves DOUBLE DUTY. During optimization, it provides the gradient signal. After optimization, it provides the verification signal. No external verifier needed. No reward model. No fine-tuning. The model is its own critic.

Thinking Longer: results on language

The paper tests thinking longer on four Out-of-Distribution (OOD) benchmarks. Standard Transformer++ models cannot benefit from more forward passes — they make the same prediction every time (deterministic feed-forward). EBTs show up to 29% improvement with more thinking steps:

The visualization above shows the "Thinking Longer" experiment. The orange line (Transformer++) is flat — no improvement from additional forward passes, because the model can't revise its predictions. The blue curve (EBT) improves steadily as the number of forward passes increases. The X-axis is the number of forward passes (each requiring a full forward + backward through the transformer). The Y-axis is perplexity decrease on OOD tasks (lower is better).

Self-Verification: built-in Best-of-N

Traditional Best-of-N (BoN) sampling with language models requires a SEPARATE verifier or reward model. EBTs have one built in:

Generate N = 5 candidates
1, ..., ŷ5 (each optimized from random init for M steps)
Score each with energy
E(x, ŷ1) = 2.3, E(x, ŷ2) = 0.8, E(x, ŷ3) = 3.1, E(x, ŷ4) = 1.5, E(x, ŷ5) = 1.1
Select minimum
y* = ŷ2 (energy 0.8 — most compatible with context)

This scales with training data: at small scale (5B tokens trained), BoN-10 barely improves over BoN-2 (and sometimes hurts due to adversarial low-energy samples). At larger scale (30B tokens), BoN-10 gives significant gains, because the energy landscape becomes smoother and more reliable.

Uncertainty estimation for free

An unexpected benefit: per-token energy values reveal which tokens the model is uncertain about. Easy tokens (e.g., "the", "is", "a") converge to low energy quickly. Hard tokens (e.g., "brown", "research", "problem") maintain higher energy across steps.

The paper visualizes this with energy heatmaps across tokens and thinking steps. Common, predictable tokens show green (low energy) after 1-2 steps. Rare, context-dependent tokens remain yellow-red (high energy) even after 10 steps. This happens without any explicit uncertainty training — it emerges naturally from the energy formulation.

Why OOD performance improves MORE: The paper finds a striking linear trend: as data becomes more out-of-distribution, thinking helps MORE. On in-distribution data, thinking gives modest gains. On highly OOD data, thinking gives large gains. This mirrors human cognition — System 2 Thinking is most valuable precisely when problems are unfamiliar.

The thinking-generalization connection

Perhaps the paper's most profound finding: EBTs generalize better than Transformers++ even WITHOUT thinking at inference time. Despite having slightly worse pretraining perplexity (33.43 vs 31.36), EBTs achieve better downstream performance on 3 out of 4 benchmarks (GSM8K: 43.3 vs 49.6, SQuAD: 53.1 vs 52.3, BB Math QA: 72.6 vs 79.8, BB Dyck: 125.3 vs 131.5). With thinking, the gap widens further.

Why? The paper hypothesizes it's because verification generalizes better than generation. A verifier trained on in-distribution data can still correctly assess predictions on OOD data, because checking correctness is often independent of how the data was generated.

Why does System 2 Thinking help MORE on out-of-distribution data?

Chapter 7: Scaling Laws

Scaling laws are the single most predictive indicator of an architecture's future potential. If a model shows good scaling behavior at small size, it will almost certainly be competitive at large size. The EBT paper's most compelling quantitative contribution is demonstrating that EBTs scale FASTER than Transformer++ across ALL measured axes.

What is a scaling rate?

Following the Chinchilla framework, the paper models loss as a power law:

L(C) = β C−α + E

where L is the loss, C is the compute budget (FLOPs), β is a constant, α is the scaling exponent (higher = faster improvement), and E is the irreducible entropy of the data.

In log-log space, subtracting E, this becomes a line:

log(L − E) ≈ −α log(C) + log(β)

The slope α is what matters. A steeper slope means the model extracts more performance from each additional unit of compute. The paper compares the slopes of EBTs vs. Transformer++ across six independent axes.

Learning scalability (6 axes)

The paper conducts scaling experiments on RedPajama V2 text (66M training, 33K validation samples) with the GPT-NeoX tokenizer (50,277 vocab). Results across all six axes:

Scaling AxisEBT Faster ByHow Measured
Data (# tokens)35.98%Validation perplexity vs. training tokens (1B-30B)
Batch size28.66%Val PPL vs. batch size (4K-48K tokens)
Depth (# layers)3.29%Val PPL vs. transformer depth (2-14 blocks)
Parameters (non-embed)8.97%Val PPL vs. total non-embedding parameters (6M-396M)
FLOPs8.97%Val PPL vs. training FLOPs
Width (embed dim)0.62%Val PPL vs. embedding dimension (384-2048)
The standout: 35.98% faster scaling on DATA is enormous. It means EBTs extract more learning from each additional token of training data. This is especially important given that high-quality training data is becoming scarce — EBTs are more data-efficient. The implication: at the 15T-token scale of modern foundation models, EBTs would have a significant advantage.

Note that the width scaling advantage is only 0.62% — nearly identical. This makes sense: width primarily affects the model's representational capacity, which is orthogonal to the energy-based training paradigm. The big wins are in data efficiency and depth scaling, which directly relate to how well the model shapes and traverses its energy landscape.

Understanding the scaling curves

The interactive chart above shows the scaling laws for EBTs (blue) vs. Transformer++ (orange) across the six axes. Select different axes using the buttons. Both curves follow power laws in log-log space, but the EBT line has a consistently steeper slope. The gap between the lines grows with scale — larger models show bigger advantages for EBTs.

Thinking scalability

Beyond learning scalability, the paper also measures thinking scalability — how much improvement thinking gives as a function of model scale.

The key metric is System Two Thinking (STT):

STT(x, θ, F) = Ex[ P(x, θ, F) / P(x, θ, F0) − 1 ]

where P(x, θ, F) is the performance with F function evaluations (forward passes), and F0 is the minimum number of evaluations. STT measures the percentage improvement from thinking.

Results show:

Video scaling

The paper also tests on continuous modalities using Something-Something V2 (video prediction). Scaling results:

These large advantages in continuous modalities may be because EBTs naturally model continuous distributions through their energy landscape, while standard transformers must discretize continuous data (e.g., via Vector Quantization) or use proxy objectives (e.g., MSE loss), losing information in the process.

Data efficiency + generalization: EBTs' better data scaling is linked to their better generalization. The energy landscape paradigm provides a more efficient inductive bias: instead of memorizing input-output mappings, the model learns a COMPATIBILITY function that generalizes to new input-output pairs. This is analogous to how a spell-checker (verifier) works on any text, while a text-generator needs exposure to each specific writing style.
Which scaling axis shows the largest advantage for EBTs over Transformer++?

Chapter 8: Results

The paper presents results across three domains: autoregressive language modeling, bidirectional image denoising, and data-constrained Sudoku reasoning. Let's examine each with the critical eye of a researcher.

Language modeling results

All language models are pretrained on RedPajama V2 using GPT-NeoX tokenizer. Downstream evaluation uses four benchmarks of increasing difficulty:

BenchmarkTaskTransformer++ (PPL↓)EBT (PPL↓)Winner
GSM8KMath word problems31.3633.43Transformer++ (pretraining)
49.643.3EBT (downstream)
SQuADReading comprehension52.353.1EBT
BB Math QAMath reasoning79.872.6EBT
BB DyckBracket matching (OOD)131.5125.3EBT
The generalization paradox: EBTs have WORSE pretraining perplexity (33.43 vs 31.36) but BETTER downstream performance on most tasks. This means EBTs are not simply better at memorizing the training distribution — they learn more generalizable representations. The paper attributes this to the verification paradigm: a verifier trained on in-distribution data can assess OOD predictions, while a generator trained on in-distribution data can only produce in-distribution outputs.

Image denoising results

For continuous modalities, the paper trains bidirectional EBTs on COCO 2014 (128×128 images) using the SD-XL VAE for latent encoding. The comparison is against Diffusion Transformers (DiTs):

MetricDiTEBTEBT Advantage
In-Dist PSNR ↑26.5827.25+0.67 dB
In-Dist MSE Pixel ↓142.98122.55−14.3%
OOD PSNR ↑19.5623.29+3.73 dB
OOD MSE Pixel ↓718.7305.2−57.5%
ImageNet Top-1 Acc ↑0.31%5.32%~17×
ImageNet Top-5 Acc ↑1.36%13.2%~10×
Forward passes needed100-3001-399% fewer

The image results are striking in three ways:

1. Better denoising with 99% fewer passes. EBTs need 1-3 forward passes for comparable-or-better results vs. DiTs' 100-300 denoising steps. This is because EBTs directly minimize energy (a scalar), while DiTs predict noise at each step (a high-dimensional output).

2. Massively better OOD denoising. The OOD PSNR gap (+3.73 dB) is huge in image quality terms. This confirms the verification generalization hypothesis from the language experiments.

3. 10-17× better image classification. Linear probe classification accuracy on ImageNet-1K shows EBTs learn dramatically more useful representations. DiTs learn to predict noise; EBTs learn to understand images.

Why such better representations? DiTs are trained to predict noise — a randomly sampled Gaussian — added to the image. Their representations are optimized for noise estimation, not image understanding. EBTs are trained to verify compatibility between noisy and clean images in the INPUT SPACE. Their representations must capture what makes an image "correct" — which requires deeper understanding of image structure.

Sudoku: data-constrained reasoning

To test generalization in data-limited settings, the paper follows Du et al. (2024) in training models on Sudoku from the SAT-Net / RRN datasets. Given a partially filled board (1-9 digits filled), predict the complete solution.

ArchitectureTest Accuracy
Feed-Forward Transformer0.03%
RNN17.7%
EBT29.7%

The feed-forward transformer essentially fails (0.03%). It can't do multi-step constraint reasoning in a single pass. The RNN does better by iterating its hidden state, but still struggles. The EBT, by optimizing its prediction to satisfy all Sudoku constraints simultaneously (each constraint contributing to the energy), achieves 29.7% — 67% better than the RNN and 990× better than the transformer.

What EBTs CANNOT do (limitations)

The paper is commendably honest about limitations:

The simulation above compares EBT vs. DiT on image denoising. On the left, a DiT progressively removes noise over many steps. On the right, an EBT minimizes energy in just a few steps, converging to a cleaner result. Adjust the "noise level" slider to see how both methods handle different corruption levels. Notice how the gap widens for higher noise (more OOD) — EBTs degrade more gracefully.

Why do EBTs learn better image representations than Diffusion Transformers despite both operating in the same latent space?

Chapter 9: Connections

Cheat sheet: every key equation

EquationWhat It DoesSymbols
Eθ(x, y) → R Energy function: assigns compatibility score x = context, y = candidate prediction, θ = model params
i+1 = ŷi − α ∇y Eθ(x, ŷi) Gradient descent on predictions (thinking step) α = step size, ∇y = gradient w.r.t. prediction
i+1 = ŷi − α ∇y E + η, η ~ N(0, σ) Langevin dynamics (thinking with exploration) σ = noise magnitude, η = exploration noise
pθ(y|x) = e−Eθ(x,y) / Z(θ) Boltzmann distribution (theoretical, not computed) Z(θ) = partition function (intractable)
L(C) = β C−α + E Scaling law: loss vs. compute α = scaling exponent, C = FLOPs, E = irreducible entropy
STT(x, θ, F) = Ex[P(x,θ,F)/P(x,θ,F0) − 1] System 2 Thinking metric F = forward passes, F0 = minimum forward passes
FLOPsEBT ≈ M × 10N × 2 Per-step compute cost (M optimization steps) N = non-embedding params, M = optimization steps

EBTs vs. Diffusion Models: a detailed comparison

EBTs and diffusion models are closely related — the paper argues that diffusion models are a special case of (implicit) EBMs. The key differences:

PropertyDiffusion ModelsEBTs
SupervisionAt every timestep (noise prediction)Only at the end (final prediction vs. target)
Update rulePredict noise, follow denoising scheduleGradient descent on energy (free-form)
VerificationImplicit (no energy scalar)Explicit (energy scalar)
# StepsFixed schedule (100-1000)Dynamic (1-N, any number)
Self-verificationRequires external modelBuilt-in (compare energies)
Discrete dataRequires discretization tricksWorks natively
UncertaintyNot directly modeledEnergy = uncertainty

The connection is deepened by the paper's observation that both diffusion models and EBMs predict the gradient of the data density. Diffusion models learn ∇x log p(xt | x0) (score function). EBMs learn ∇y Eθ(x, y) (energy gradient). Both use these gradients to iteratively refine predictions.

The Reversal Curse connection

The paper makes a fascinating connection to the "Reversal Curse" in LLMs — the phenomenon where models trained on "A is B" fail to learn "B is A." In standard transformers, only A's tokens receive gradient updates during the prediction of B (because B is in the output space). In EBTs, BOTH A and B are in the input space, so both receive gradient updates. This could fundamentally resolve the asymmetric learning problem.

Related work in context

WorkRelation to EBTs
Hopfield Networks (1982)Energy-based model for associative memory; EBTs scale this concept with transformers
GANs (2014)Separate generator+discriminator; EBTs unify them in one model
Du & Mordatch (2019)Pioneered optimization-based EBM training; EBTs scale it to transformers
DiT (2023)Transformer backbone for diffusion; EBT bidirectional variant builds on DiT
o1 / DeepSeek-R1System 2 via RL + verifiers; EBTs achieve it via energy minimization alone
LeCun's JEPA (2022)Energy-based architecture for autonomous intelligence; EBTs are a concrete realization
Scaling Laws (Kaplan 2020, Chinchilla 2022)EBTs follow and improve upon established scaling law frameworks

Open questions and future directions

The big picture

EBTs represent a fundamental rethinking of how neural networks should work. Instead of the feedforward paradigm (input → one pass → output), EBTs propose an optimization paradigm (input + guess → iterate until convergence → verified output). This aligns with how humans actually think: we don't arrive at answers in one step. We hypothesize, check, revise, and iterate.

The paper demonstrates that this paradigm is not just theoretically appealing but practically competitive — scaling faster than the dominant Transformer++ approach while adding capabilities (verification, dynamic computation, uncertainty estimation) that current models fundamentally lack.

The deepest implication: If verification truly is easier than generation, then the entire field may be doing things the hard way. Training models to generate directly (softmax over vocabulary) is asking them to solve a harder problem than necessary. Training them to verify (scalar energy score) and generate by optimization may be the more natural — and more scalable — approach. EBTs are the first concrete evidence that this is true at scale.
What is the fundamental difference between how EBTs and diffusion models receive supervision during training?