What if your model didn't just predict — but verified its predictions by minimizing energy? EBTs assign a scalar energy to every input-prediction pair and think by gradient descent on that landscape, enabling System 2 reasoning from unsupervised learning alone.
You're a student taking a math exam. You write down an answer to a problem — but before moving on, you pause. You check your work. You substitute your answer back into the original equation. Does it satisfy the constraints? If not, you revise. You keep checking until you're confident.
This is System 2 Thinking — the slow, deliberate, effortful reasoning that psychologists Daniel Kahneman and Amos Tversky distinguished from System 1 (fast, automatic, intuitive). System 1 is catching a ball. System 2 is computing 17 × 24 in your head.
Modern AI has gotten remarkably good at System 1. A GPT-class model sees "The dog caught the ___" and instantly outputs "ball" in a single feed-forward pass. No checking. No revision. Just pattern matching at light speed.
But what happens when the problem is hard? When the model encounters something it's never seen during training? When the correct answer requires multi-step reasoning?
Recent attempts at System 2 Thinking in AI (o1, DeepSeek-R1, Chain of Thought) have shown exciting results, but they share critical limitations:
| Approach | Dynamic Compute? | Self-Verification? | Modality | Requires Supervision? |
|---|---|---|---|---|
| Chain of Thought | Partial (more tokens) | No | Text only | No, but unreliable |
| o1 / DeepSeek-R1 | Yes (more tokens) | Implicit | Text only | Yes (RL + verifiers) |
| Best-of-N Sampling | Yes | Requires external verifier | Text only | Yes (trained verifier) |
| Diffusion Models | Yes (more steps) | No explicit verification | Continuous only | No |
| EBTs (this paper) | Yes (per prediction) | Yes (energy scalar) | Both discrete & continuous | No — unsupervised |
Notice the bottom row. EBTs are the only approach that has all four properties: dynamic computation per prediction, built-in self-verification, works across modalities, and requires no additional supervision beyond standard pretraining.
That's an extraordinary claim. To understand why it works, we need to step back and ask: what does it mean for a model to "verify" its own predictions?
The paper identifies two cognitive facets that current models mostly lack:
Facet 1: Dynamic Allocation of Computation. Deciding whether to change careers takes more thought than deciding what to eat for lunch. Humans naturally allocate varying amounts of effort depending on difficulty. Feed-forward transformers use the same depth (same FLOPs) for every single prediction. RNNs and diffusion models can vary computation, but lack explicit verification.
Facet 2: Verification of Predictions. Humans don't just generate answers — they check them. A student doesn't just write "x = 5" and move on; they substitute back into the equation to verify. Standard transformers have no mechanism to assess the quality of their own outputs. They predict in the output space, not in a space where the prediction can be compared against the input.
The canvas above shows the key architectural difference. A standard autoregressive transformer maps context to a prediction in one pass. An EBT takes a candidate prediction, feeds it alongside the context, and outputs an energy scalar — a single number indicating how compatible this prediction is with the context. Lower energy = better match.
Now prediction becomes optimization: start with a random guess, compute the energy, take the gradient of energy with respect to the prediction, update the prediction to lower the energy, and repeat. Each iteration is a "thinking step." More steps = more thinking = better prediction.
Here's a fundamental asymmetry that has been hiding in plain sight for decades: verification is easier than generation.
Think about it. Given a completed Sudoku puzzle, you can verify it in seconds — check each row, column, and 3×3 block. But solving a Sudoku from scratch? That takes minutes of careful reasoning. Given a proof, a mathematician can verify each step far more easily than they could have discovered the proof in the first place. Given a solution to an NP-hard problem, you can check it in polynomial time, but finding it requires exponential time.
This asymmetry is well-known in complexity theory (P vs NP), cryptography (verifying a signature vs. forging one), and everyday life (proofreading vs. writing). Yet current AI models do the HARD thing — they generate directly in one shot — and never attempt the EASY thing: verification.
This is the core idea of Energy-Based Models (EBMs). An EBM is a function Eθ(x, y) that takes an input x (context) and a candidate prediction y, and outputs a single scalar: the energy. Low energy means "y is a good prediction given x." High energy means "y is incompatible with x."
Let's make this concrete with next-token prediction. In a standard transformer:
In an EBT:
The EBT did something remarkable: it ran gradient descent on its own predictions. Not on the model weights — those are frozen at inference time. On the predictions themselves. Each iteration is a "thinking step" where the model refines its guess to better match the context.
This formulation naturally gives us both cognitive facets:
Dynamic computation (Facet 1): Easy predictions converge in 2-3 steps. Hard predictions need 10+ steps. The model automatically allocates more computation to harder problems, because harder problems have more complex energy landscapes that take longer to optimize.
Self-verification (Facet 2): The energy value itself IS the verification. After thinking, the model can check: "Is this energy low enough? Should I keep thinking?" It can also generate multiple candidates and pick the one with the lowest energy — a form of best-of-N sampling with a built-in verifier, requiring no external model.
The simulation above shows the verification-generation asymmetry in action. On the left, a standard transformer generates in one pass — fast but can't improve. On the right, an EBT starts from a random prediction and iteratively refines it by minimizing energy. Watch how the EBT's prediction progressively shifts toward the correct distribution over multiple steps.
Let's trace what happens when an EBT predicts the next token for the context "The dog caught the ___". The vocabulary has 50,277 tokens. The prediction y is a distribution over these tokens (a vector of 50,277 logits).
Step 0 (initialization): y0 ~ N(0, I). All 50,277 logits drawn from a standard Gaussian. This is pure noise — essentially uniform over the vocabulary.
Step 0 energy: Eθ(x, y0) = 14.2. Very high. The random prediction is incompatible with the context.
Step 1: Compute ∇y Eθ(x, y0). This 50,277-dimensional gradient tells us how to adjust each logit to reduce energy. Update: y1 = y0 − α ∇y E. Energy drops to 8.7.
Step 2: Repeat. y2 = y1 − α ∇y Eθ(x, y1). Energy drops to 3.1. The distribution is starting to concentrate on plausible tokens: "ball", "frisbee", "stick".
Step 3: y3 = y2 − α ∇y Eθ(x, y2). Energy drops to 0.9. Strong peak at "ball".
Each step required a full forward AND backward pass through the transformer to compute E and ∇y E. Three thinking steps = 6 passes (3 forward + 3 backward). This is 6× more compute than a standard transformer's single forward pass. But the model gets to THINK — and the result is better.
Before we dive into the architecture, we need solid footing on Energy-Based Models. The concept has been around for decades — Hopfield networks (1982), Boltzmann machines (1985), contrastive learning — but making them work at scale has been an open problem until now.
An energy function Eθ(x, y) is just a neural network that takes two inputs — some context x and a candidate prediction y — and outputs a single scalar. That's it. No softmax. No probability distribution. Just a number.
The interpretation is physical: think of a ball on a hilly landscape. The ball naturally rolls to the lowest point — the minimum energy state. High hills = high energy = unlikely configurations. Low valleys = low energy = likely configurations.
For a given context x, the energy function defines a landscape over all possible predictions y. The ground truth prediction y* should sit at the bottom of a valley — a local (ideally global) minimum of Eθ(x, ·).
You CAN convert an energy function into a probability distribution using the Boltzmann distribution:
where Z(θ) = ∫ e−Eθ(x, y) dy is the partition function — an integral over ALL possible predictions y.
Here's the problem: computing Z(θ) requires integrating over every possible y. For a vocabulary of 50,277 tokens, that's a sum over 50,277 terms (doable). But for continuous predictions like images (256 × 256 × 3 real-valued pixels), that integral is over a space of dimension 196,608. Completely intractable.
Historically, there have been two main approaches, and understanding why one fails at scale is crucial to understanding the EBT paper's contribution:
Approach 1: Contrastive methods. Push down the energy of "positive" (real) samples while pushing up the energy of "negative" (fake) samples. The catch: to push up negative energy everywhere in a high-dimensional space, you need an exponentially growing number of negatives. This is the curse of dimensionality — contrastive methods don't scale.
Approach 2: Optimization-based training. Train the EBM so that gradient descent FROM A RANDOM STARTING POINT converges to the ground truth y*. This implicitly shapes the landscape to have a valley at y* without needing negative samples. This is what EBTs use.
Let's derive the training procedure step by step. We have:
Forward pass (energy minimization on the prediction):
Start with a random prediction: y0 ~ N(0, I)
For i = 0, 1, ..., M − 1:
This is gradient descent on the prediction. At each step, we compute the gradient of the energy WITH RESPECT TO y (not θ), and update y to reduce the energy. After M steps, we have a refined prediction ŷM.
Backward pass (update model parameters):
Compute a loss comparing ŷM to the ground truth y*:
where J can be any standard loss (cross-entropy for discrete, MSE for continuous). Now backpropagate THROUGH THE ENTIRE OPTIMIZATION PROCESS to update θ.
Contrastive methods must explicitly push up energy in an exponentially large space. The optimization approach does something more subtle: by training the model so that gradient descent converges to y*, it implicitly regularizes the landscape to have a single local minimum near y*. The landscape is shaped by the optimization dynamics, not by explicit negative sampling.
Think of it like sculpting. Contrastive methods try to chip away at every piece of marble that isn't the statue (exponential work in high dimensions). The optimization approach shapes the landscape so that water (gradient descent) naturally flows to the statue (the answer). You only need to shape the channels, not remove all the marble.
Here's a subtlety that's easy to miss. During the forward pass, we compute ∇y Eθ — a gradient of E with respect to y. During the backward pass, we differentiate the LOSS (which depends on these gradients) with respect to θ. This is a gradient of a gradient — a second-order derivative.
Computing full Hessians is O(n²) in the number of parameters. But EBTs use Hessian-vector products (HVPs), which can be computed in O(n) time — the same cost as a standard backward pass. In PyTorch, this is implemented via torch.autograd.grad with create_graph=True, which keeps the computation graph alive for a second differentiation.
The total cost per training step with M optimization steps is approximately:
where N is the number of non-embedding parameters. With M = 2 (the paper's default), this is roughly 3.33× more expensive per training step than a standard Transformer++ step (which costs 6N FLOPs).
The simulation above lets you explore an energy landscape. The blue dot is the current prediction, and the red star is the ground truth. Click "Step" to perform one gradient descent step on the prediction. Watch how the prediction moves downhill toward the energy minimum. The landscape is shaped so that the minimum coincides with the ground truth — this is what training achieves.
Now that we understand the energy-based paradigm, let's see how to actually build a Transformer that functions as an energy model. This is where EBTs differ from standard transformers in subtle but critical ways.
The paper introduces two EBT architectures:
Autoregressive EBT (AR-EBT): A GPT-style decoder-only transformer for next-token prediction. Uses causal attention. This is the primary variant for language modeling experiments.
Bidirectional EBT (Bi-EBT): A BERT-style transformer with full bidirectional attention. Used for image denoising and classification experiments. Built on the DiT (Diffusion Transformer) architecture.
In a standard transformer, the model takes context x and produces a prediction in the OUTPUT space — it generates y directly. In an EBT, the model takes BOTH x and a candidate y and produces an energy scalar. This means predictions must live in the INPUT space, not the output space.
This difference has profound consequences for the attention mechanism.
In a standard causal transformer with n tokens, the attention mask is lower-triangular: token i can attend to tokens 1 through i. After the causal mask, the n × n attention scores matrix looks like:
For an EBT, we have n past (context) tokens z1, ..., zn and we're predicting future tokens ŷn+1, ..., ŷn+k. The attention matrix must be (n+k) × (n+1) and satisfy special constraints:
Rule 1: Each past token zi attends to all previous past tokens z1, ..., zi (standard causal attention).
Rule 2: Each predicted future token ŷj attends to ALL past tokens z1, ..., zn (it can see the full context).
Rule 3: Each predicted future token ŷj attends to ITSELF (to incorporate its own representation), but NOT to other predicted tokens.
Rule 3 is the tricky part. In standard attention, the diagonal comes for free from the Q KT computation. But here, each predicted token ŷj is a DIFFERENT prediction (different gradient descent trajectory), so the "self-attention" on the diagonal can't be computed with a single matrix multiply.
The paper splits the sequence into two groups:
For the context tokens, attention is computed normally:
For the predicted tokens, the paper computes separate Qp, Kp, Vp matrices and constructs the attention in four steps:
This scheme effectively doubles the sequence length (from n to 2n) but thanks to the efficient masking, the total FLOPs are approximately 2× a standard transformer, not 4×.
After the transformer blocks, the EBT needs to produce a single energy scalar. For autoregressive EBTs:
The energy is the sum of per-token energy scalars. This means each token contributes independently to the total energy, allowing the model to identify WHICH tokens are problematic (high individual energy) — this is what enables the uncertainty visualizations in the paper's results.
The diagram above shows the complete data flow of an autoregressive EBT. Context tokens (green) flow through normal causal attention. Predicted tokens (orange) are initialized randomly and attend to all context tokens plus themselves. Each transformer block updates both representations. The final energy head produces a scalar per predicted token, which are summed into a total energy.
| Size | Non-Embed Params | # Layers | Embed Dim | # Heads |
|---|---|---|---|---|
| XXS | 6.18M | 6 | 384 | 6 |
| XS | 12.4M | 12 | 384 | 6 |
| Small | 48.8M | 12 | 768 | 12 |
| Medium | 176M | 24 | 1024 | 16 |
| Large | 396M | 24 | 1536 | 16 |
| XL | 708M | 24 | 2048 | 32 |
The architecture follows the Llama 2 / Transformer++ recipe: RMSNorm, RoPE position embeddings, SwiGLU activation, no bias terms. The only additions are the energy head and the modified attention scheme for predicted tokens.
We've seen the architecture. Now the hard part: actually training these models to produce well-shaped energy landscapes. This chapter covers the training algorithms in detail, including the GAN-like duality that makes EBTs work.
Let's walk through a single training step, line by line. We have a training example (x, y*), the EBM Eθ, step size α, number of optimization steps M, and a loss function J.
# Algorithm 1: EBT Training def train_step(x, y_star, E_theta, alpha, M, loss_fn): # Step 1: Initialize prediction from random noise y_hat = torch.randn_like(y_star) # y_0 ~ N(0, I) y_hat.requires_grad_(True) # Step 2: Run M gradient descent steps on the prediction for i in range(M): energy = E_theta(x, y_hat) # Gradient of energy w.r.t. prediction (NOT params) grad_y = torch.autograd.grad( energy, y_hat, create_graph=True )[0] y_hat = y_hat - alpha * grad_y # Step 3: Compute loss against ground truth loss = loss_fn(y_hat, y_star) # Step 4: Backpropagate through EVERYTHING # (including all M gradient steps) to update theta loss.backward() optimizer.step()
The crucial detail is create_graph=True on line 10. Without this flag, PyTorch discards the computation graph after computing the gradient, making it impossible to backpropagate through the optimization steps. With it, the entire M-step optimization trajectory is part of the computation graph.
Let's trace through with concrete numbers. Suppose we're training on a tiny vocabulary of 5 tokens: {cat, dog, the, runs, fast}. The context is "the dog ___" and the ground truth next token is "runs" (one-hot: [0, 0, 0, 1, 0]).
Step 0: ŷ0 = [0.3, −0.7, 1.2, −0.4, 0.8] (random Gaussian). Energy: E = 12.5.
Gradient: ∇y E = [0.1, −0.3, 0.6, −0.9, 0.5]. The model "knows" it should push probability toward index 3 (runs) — that dimension has the most negative gradient.
Update (alpha=0.5): ŷ1 = [0.3 − 0.05, −0.7 + 0.15, 1.2 − 0.3, −0.4 + 0.45, 0.8 − 0.25] = [0.25, −0.55, 0.9, 0.05, 0.55]
Step 1 energy: E = 6.8. Better, but still high.
Step 2: Another gradient step. ŷ2 = [0.1, −0.3, 0.4, 0.8, 0.2]. Energy: 2.1. Now "runs" has the highest logit.
Loss: Cross-entropy between softmax(ŷ2) = [0.12, 0.08, 0.16, 0.40, 0.13] and one-hot [0, 0, 0, 1, 0]. Loss = −log(0.40) = 0.92.
Backprop: This loss gradient flows backward THROUGH both gradient descent steps AND the energy function to update θ. The model learns to shape Eθ so that future gradient descent starting from random noise will converge more quickly to "runs".
The paper discovers an important distinction between two training configurations:
S1 (System 1) models: Optimized for stability and learning convergence. The gradients of predictions are DETACHED between optimization steps — backpropagation doesn't flow through the entire chain. Simpler, more stable, but weaker thinking capabilities.
S2 (System 2) models: Full backpropagation through all optimization steps (no detaching). Include all energy landscape regularization techniques (next chapter). More expensive, but enable genuine System 2 Thinking.
| Property | S1 Models | S2 Models |
|---|---|---|
| Gradient flow | Detached between steps | Full backprop through all steps |
| Replay buffer | No | Yes |
| Langevin dynamics | No | Yes |
| Random step size | No | Yes |
| Training stability | Higher | Lower (more careful tuning) |
| Learning scaling rate | Baseline | ~3.3% faster |
| Thinking ability | Limited | Strong (up to 29% improvement) |
The paper finds that S1 and S2 models have similar scaling rates — the S2 scaling curve is just shifted up (higher initial loss but same slope). This means you can choose S1 for pure pretraining and switch to S2 when you need thinking capabilities, without losing scaling behavior.
The simulation above lets you watch EBT training in action. The energy landscape (blue curve) starts flat and gradually develops a minimum at the ground truth location (red star). Each training iteration updates θ so that the landscape's gradient flow points more strongly toward the target. Toggle between S1 and S2 modes to see the difference in landscape regularization.
The optimization step size α has a major impact on training. Too large: predictions overshoot and oscillate. Too small: convergence is slow and the model wastes optimization steps. The paper makes α learnable — it's a trainable parameter that the model optimizes alongside its weights. For S2 models, the step size is multiplied by 1500 relative to S1, which the paper finds is necessary for the full backpropagation to produce useful gradients.
create_graph=True enable in EBT training?Training an EBM to have well-shaped energy landscapes in high-dimensional space is like trying to sculpt a mountain range where every valley leads to the correct answer. It's hard. The paper identifies three regularization techniques that are critical for enabling System 2 Thinking. Each one addresses a specific failure mode.
Without regularization, the energy landscape tends to be well-shaped only in a narrow region around the ground truth. Start gradient descent from far away, and you hit flat regions where gradients vanish — the optimization gets stuck. This means thinking for more steps doesn't help.
Solution: Replay Buffer. Instead of always initializing predictions from random noise N(0, I), maintain a buffer of partially-optimized predictions from previous training steps. Occasionally initialize FROM THESE rather than from scratch. This forces the model to learn landscapes that are well-shaped even far from the optimum, because the training loss depends on successfully navigating from these diverse starting points.
# Replay buffer implementation replay_buffer = [] def get_initial_prediction(y_star, buffer, p_replay=0.5): if random() < p_replay and len(buffer) > 0: # Start from a previous partial optimization y_0 = buffer.pop(random_index) else: # Start from random noise y_0 = torch.randn_like(y_star) return y_0 # After optimization, store the result replay_buffer.append(y_hat_M.detach())
Even with a replay buffer, gradient descent is deterministic — from a given starting point, it always follows the same path. If there's a local minimum between the starting point and the global minimum, the optimization gets trapped.
Solution: Langevin Dynamics. Add noise to each gradient step:
This is the same idea as simulated annealing or stochastic gradient descent: the noise allows the optimization to escape local minima by occasionally going "uphill." With the right noise level, Langevin dynamics is guaranteed to converge to the global minimum of the energy landscape.
With a fixed step size α, gradient descent always takes the same-size steps. This means different training examples always explore the landscape at the same resolution, potentially missing important features.
Solution: Random Step Size and Random Number of Steps. Instead of fixed α and M, randomize both during training:
This creates diversity in the optimization trajectories during training. Some trajectories take many small steps (fine exploration). Others take few large steps (coarse exploration). The model must learn a landscape that works well for ALL these trajectories, making it more robust at inference time.
The paper provides an ablation study on the BigBench Dyck Languages benchmark (an out-of-distribution reasoning task). Results with "Thinking Longer" (more optimization steps) and "Self-Verification" (best-of-N with energy):
| Configuration | Thinking Longer (%↑) | Thinking + Self-Verification (%↑) |
|---|---|---|
| No Random Step Size | −1.47 | 0.19 |
| No Random Num Steps | 0.00 | 9.65 |
| No Langevin Dynamics | 17.2 | 17.0 |
| No Replay Buffer | 14.8 | 17.8 |
| Full S2 Config | 7.19 | 18.7 |
Note the first row: without random step size, thinking longer actually HURTS performance (−1.47%). This means the model learned a landscape that only works for a specific step size. Randomization during training is critical for robust thinking at inference time.
This is the SHOWCASE simulation. The energy landscape is shown with three regularization controls: Replay Buffer (toggles starting points), Langevin Dynamics (adds noise to steps), and Random Step Size (varies step magnitude). Start by running gradient descent with no regularization — watch how the prediction gets stuck in a local minimum. Then enable each technique and see how the optimization behavior changes. With all three enabled, the prediction reliably reaches the global minimum.
We've built the architecture. We've trained it. Now let's see how EBTs actually "think" at inference time. This is where the payoff lives.
The paper explores two complementary approaches to inference-time thinking:
Strategy 1: Thinking Longer. Run more optimization steps on a single prediction. Each step refines the prediction by following the energy gradient. More steps = more refinement = better prediction. This is like a student spending more time checking and revising a single answer.
Strategy 2: Self-Verification (Best-of-N). Generate N independent predictions (each starting from different random initializations), optimize each for M steps, then select the prediction with the LOWEST energy. This is like a student writing N different answers and submitting the one they're most confident about.
# Algorithm 2: Inference with Verification def infer(x, E_theta, alpha, M, N): best_y, best_energy = None, float('inf') for j in range(N): # N independent samples y_hat = torch.randn(prediction_dim) # fresh random start for i in range(M): # M optimization steps energy = E_theta(x, y_hat) grad_y = torch.autograd.grad(energy, y_hat)[0] y_hat = y_hat - alpha * grad_y # Self-verification: keep the lowest energy prediction final_energy = E_theta(x, y_hat) if final_energy < best_energy: best_energy = final_energy best_y = y_hat return best_y
The key insight: the energy scalar serves DOUBLE DUTY. During optimization, it provides the gradient signal. After optimization, it provides the verification signal. No external verifier needed. No reward model. No fine-tuning. The model is its own critic.
The paper tests thinking longer on four Out-of-Distribution (OOD) benchmarks. Standard Transformer++ models cannot benefit from more forward passes — they make the same prediction every time (deterministic feed-forward). EBTs show up to 29% improvement with more thinking steps:
The visualization above shows the "Thinking Longer" experiment. The orange line (Transformer++) is flat — no improvement from additional forward passes, because the model can't revise its predictions. The blue curve (EBT) improves steadily as the number of forward passes increases. The X-axis is the number of forward passes (each requiring a full forward + backward through the transformer). The Y-axis is perplexity decrease on OOD tasks (lower is better).
Traditional Best-of-N (BoN) sampling with language models requires a SEPARATE verifier or reward model. EBTs have one built in:
This scales with training data: at small scale (5B tokens trained), BoN-10 barely improves over BoN-2 (and sometimes hurts due to adversarial low-energy samples). At larger scale (30B tokens), BoN-10 gives significant gains, because the energy landscape becomes smoother and more reliable.
An unexpected benefit: per-token energy values reveal which tokens the model is uncertain about. Easy tokens (e.g., "the", "is", "a") converge to low energy quickly. Hard tokens (e.g., "brown", "research", "problem") maintain higher energy across steps.
The paper visualizes this with energy heatmaps across tokens and thinking steps. Common, predictable tokens show green (low energy) after 1-2 steps. Rare, context-dependent tokens remain yellow-red (high energy) even after 10 steps. This happens without any explicit uncertainty training — it emerges naturally from the energy formulation.
Perhaps the paper's most profound finding: EBTs generalize better than Transformers++ even WITHOUT thinking at inference time. Despite having slightly worse pretraining perplexity (33.43 vs 31.36), EBTs achieve better downstream performance on 3 out of 4 benchmarks (GSM8K: 43.3 vs 49.6, SQuAD: 53.1 vs 52.3, BB Math QA: 72.6 vs 79.8, BB Dyck: 125.3 vs 131.5). With thinking, the gap widens further.
Why? The paper hypothesizes it's because verification generalizes better than generation. A verifier trained on in-distribution data can still correctly assess predictions on OOD data, because checking correctness is often independent of how the data was generated.
Scaling laws are the single most predictive indicator of an architecture's future potential. If a model shows good scaling behavior at small size, it will almost certainly be competitive at large size. The EBT paper's most compelling quantitative contribution is demonstrating that EBTs scale FASTER than Transformer++ across ALL measured axes.
Following the Chinchilla framework, the paper models loss as a power law:
where L is the loss, C is the compute budget (FLOPs), β is a constant, α is the scaling exponent (higher = faster improvement), and E is the irreducible entropy of the data.
In log-log space, subtracting E, this becomes a line:
The slope α is what matters. A steeper slope means the model extracts more performance from each additional unit of compute. The paper compares the slopes of EBTs vs. Transformer++ across six independent axes.
The paper conducts scaling experiments on RedPajama V2 text (66M training, 33K validation samples) with the GPT-NeoX tokenizer (50,277 vocab). Results across all six axes:
| Scaling Axis | EBT Faster By | How Measured |
|---|---|---|
| Data (# tokens) | 35.98% | Validation perplexity vs. training tokens (1B-30B) |
| Batch size | 28.66% | Val PPL vs. batch size (4K-48K tokens) |
| Depth (# layers) | 3.29% | Val PPL vs. transformer depth (2-14 blocks) |
| Parameters (non-embed) | 8.97% | Val PPL vs. total non-embedding parameters (6M-396M) |
| FLOPs | 8.97% | Val PPL vs. training FLOPs |
| Width (embed dim) | 0.62% | Val PPL vs. embedding dimension (384-2048) |
Note that the width scaling advantage is only 0.62% — nearly identical. This makes sense: width primarily affects the model's representational capacity, which is orthogonal to the energy-based training paradigm. The big wins are in data efficiency and depth scaling, which directly relate to how well the model shapes and traverses its energy landscape.
The interactive chart above shows the scaling laws for EBTs (blue) vs. Transformer++ (orange) across the six axes. Select different axes using the buttons. Both curves follow power laws in log-log space, but the EBT line has a consistently steeper slope. The gap between the lines grows with scale — larger models show bigger advantages for EBTs.
Beyond learning scalability, the paper also measures thinking scalability — how much improvement thinking gives as a function of model scale.
The key metric is System Two Thinking (STT):
where P(x, θ, F) is the performance with F function evaluations (forward passes), and F0 is the minimum number of evaluations. STT measures the percentage improvement from thinking.
Results show:
The paper also tests on continuous modalities using Something-Something V2 (video prediction). Scaling results:
These large advantages in continuous modalities may be because EBTs naturally model continuous distributions through their energy landscape, while standard transformers must discretize continuous data (e.g., via Vector Quantization) or use proxy objectives (e.g., MSE loss), losing information in the process.
The paper presents results across three domains: autoregressive language modeling, bidirectional image denoising, and data-constrained Sudoku reasoning. Let's examine each with the critical eye of a researcher.
All language models are pretrained on RedPajama V2 using GPT-NeoX tokenizer. Downstream evaluation uses four benchmarks of increasing difficulty:
| Benchmark | Task | Transformer++ (PPL↓) | EBT (PPL↓) | Winner |
|---|---|---|---|---|
| GSM8K | Math word problems | 31.36 | 33.43 | Transformer++ (pretraining) |
| 49.6 | 43.3 | EBT (downstream) | ||
| SQuAD | Reading comprehension | 52.3 | 53.1 | EBT |
| BB Math QA | Math reasoning | 79.8 | 72.6 | EBT |
| BB Dyck | Bracket matching (OOD) | 131.5 | 125.3 | EBT |
For continuous modalities, the paper trains bidirectional EBTs on COCO 2014 (128×128 images) using the SD-XL VAE for latent encoding. The comparison is against Diffusion Transformers (DiTs):
| Metric | DiT | EBT | EBT Advantage |
|---|---|---|---|
| In-Dist PSNR ↑ | 26.58 | 27.25 | +0.67 dB |
| In-Dist MSE Pixel ↓ | 142.98 | 122.55 | −14.3% |
| OOD PSNR ↑ | 19.56 | 23.29 | +3.73 dB |
| OOD MSE Pixel ↓ | 718.7 | 305.2 | −57.5% |
| ImageNet Top-1 Acc ↑ | 0.31% | 5.32% | ~17× |
| ImageNet Top-5 Acc ↑ | 1.36% | 13.2% | ~10× |
| Forward passes needed | 100-300 | 1-3 | 99% fewer |
The image results are striking in three ways:
1. Better denoising with 99% fewer passes. EBTs need 1-3 forward passes for comparable-or-better results vs. DiTs' 100-300 denoising steps. This is because EBTs directly minimize energy (a scalar), while DiTs predict noise at each step (a high-dimensional output).
2. Massively better OOD denoising. The OOD PSNR gap (+3.73 dB) is huge in image quality terms. This confirms the verification generalization hypothesis from the language experiments.
3. 10-17× better image classification. Linear probe classification accuracy on ImageNet-1K shows EBTs learn dramatically more useful representations. DiTs learn to predict noise; EBTs learn to understand images.
To test generalization in data-limited settings, the paper follows Du et al. (2024) in training models on Sudoku from the SAT-Net / RRN datasets. Given a partially filled board (1-9 digits filled), predict the complete solution.
| Architecture | Test Accuracy |
|---|---|
| Feed-Forward Transformer | 0.03% |
| RNN | 17.7% |
| EBT | 29.7% |
The feed-forward transformer essentially fails (0.03%). It can't do multi-step constraint reasoning in a single pass. The RNN does better by iterating its hidden state, but still struggles. The EBT, by optimizing its prediction to satisfy all Sudoku constraints simultaneously (each constraint contributing to the energy), achieves 29.7% — 67% better than the RNN and 990× better than the transformer.
The paper is commendably honest about limitations:
The simulation above compares EBT vs. DiT on image denoising. On the left, a DiT progressively removes noise over many steps. On the right, an EBT minimizes energy in just a few steps, converging to a cleaner result. Adjust the "noise level" slider to see how both methods handle different corruption levels. Notice how the gap widens for higher noise (more OOD) — EBTs degrade more gracefully.
| Equation | What It Does | Symbols |
|---|---|---|
| Eθ(x, y) → R | Energy function: assigns compatibility score | x = context, y = candidate prediction, θ = model params |
| ŷi+1 = ŷi − α ∇y Eθ(x, ŷi) | Gradient descent on predictions (thinking step) | α = step size, ∇y = gradient w.r.t. prediction |
| ŷi+1 = ŷi − α ∇y E + η, η ~ N(0, σ) | Langevin dynamics (thinking with exploration) | σ = noise magnitude, η = exploration noise |
| pθ(y|x) = e−Eθ(x,y) / Z(θ) | Boltzmann distribution (theoretical, not computed) | Z(θ) = partition function (intractable) |
| L(C) = β C−α + E | Scaling law: loss vs. compute | α = scaling exponent, C = FLOPs, E = irreducible entropy |
| STT(x, θ, F) = Ex[P(x,θ,F)/P(x,θ,F0) − 1] | System 2 Thinking metric | F = forward passes, F0 = minimum forward passes |
| FLOPsEBT ≈ M × 10N × 2 | Per-step compute cost (M optimization steps) | N = non-embedding params, M = optimization steps |
EBTs and diffusion models are closely related — the paper argues that diffusion models are a special case of (implicit) EBMs. The key differences:
| Property | Diffusion Models | EBTs |
|---|---|---|
| Supervision | At every timestep (noise prediction) | Only at the end (final prediction vs. target) |
| Update rule | Predict noise, follow denoising schedule | Gradient descent on energy (free-form) |
| Verification | Implicit (no energy scalar) | Explicit (energy scalar) |
| # Steps | Fixed schedule (100-1000) | Dynamic (1-N, any number) |
| Self-verification | Requires external model | Built-in (compare energies) |
| Discrete data | Requires discretization tricks | Works natively |
| Uncertainty | Not directly modeled | Energy = uncertainty |
The connection is deepened by the paper's observation that both diffusion models and EBMs predict the gradient of the data density. Diffusion models learn ∇x log p(xt | x0) (score function). EBMs learn ∇y Eθ(x, y) (energy gradient). Both use these gradients to iteratively refine predictions.
The paper makes a fascinating connection to the "Reversal Curse" in LLMs — the phenomenon where models trained on "A is B" fail to learn "B is A." In standard transformers, only A's tokens receive gradient updates during the prediction of B (because B is in the output space). In EBTs, BOTH A and B are in the input space, so both receive gradient updates. This could fundamentally resolve the asymmetric learning problem.
| Work | Relation to EBTs |
|---|---|
| Hopfield Networks (1982) | Energy-based model for associative memory; EBTs scale this concept with transformers |
| GANs (2014) | Separate generator+discriminator; EBTs unify them in one model |
| Du & Mordatch (2019) | Pioneered optimization-based EBM training; EBTs scale it to transformers |
| DiT (2023) | Transformer backbone for diffusion; EBT bidirectional variant builds on DiT |
| o1 / DeepSeek-R1 | System 2 via RL + verifiers; EBTs achieve it via energy minimization alone |
| LeCun's JEPA (2022) | Energy-based architecture for autonomous intelligence; EBTs are a concrete realization |
| Scaling Laws (Kaplan 2020, Chinchilla 2022) | EBTs follow and improve upon established scaling law frameworks |
EBTs represent a fundamental rethinking of how neural networks should work. Instead of the feedforward paradigm (input → one pass → output), EBTs propose an optimization paradigm (input + guess → iterate until convergence → verified output). This aligns with how humans actually think: we don't arrive at answers in one step. We hypothesize, check, revise, and iterate.
The paper demonstrates that this paradigm is not just theoretically appealing but practically competitive — scaling faster than the dominant Transformer++ approach while adding capabilities (verification, dynamic computation, uncertainty estimation) that current models fundamentally lack.