Harvard + Stanford, 2026 — arXiv 2602.04215

OAT: Ordered Action Tokenization

A learned tokenizer that converts 32-step robot action chunks into just 8 ordered discrete tokens — where the first token alone gives you a valid (coarse) action, and each additional token sharpens the motion like progressive JPEG.

Prerequisites: Transformers + Basic robotics / VLAs. That's it.
10
Chapters
7+
Simulations
3
Desiderata

Chapter 0: The Problem

A robot arm picks up objects by moving 7 joints through space — each joint has a continuous angle that changes 50 times per second. A 32-timestep action chunk (about 0.6 seconds of motion) is therefore a matrix of 32 × 7 = 224 continuous floating-point numbers.

Modern robot policies built on language models need to output these actions as discrete tokens — the same way a language model outputs words. But there is no "word vocabulary" for motor commands. You have to build one. And that choice turns out to matter enormously.

The naive approach: binning

The obvious solution is binning: divide each continuous dimension into N discrete levels (say 256 bins), then represent each of the 224 values as one token. A single action chunk becomes 224 tokens. The transformer must predict them one at a time.

224 tokens per action chunk is a disaster in practice. Language models predict one token per forward pass. If each forward pass takes ~10ms on modern hardware, generating one action chunk takes 224 × 10ms = 2.24 seconds. But the robot needs a new action every 20ms (50 Hz). The math doesn't work.

Why compression matters

The solution is compression: represent the action chunk as far fewer tokens. If we can express 32 × 7 = 224 values with just 8 tokens, each forward pass is 8 × 10ms = 80ms — still too slow per-chunk, but KV caching and parallel decoding bring this to ~27ms, making real-time control feasible.

But compression is not the only challenge. The way you design the token vocabulary shapes whether the resulting policy is safe, useful, and efficient. This lesson is about designing that vocabulary correctly.

The core tension: More tokens = higher fidelity but slower inference. Fewer tokens = faster but coarser actions. OAT resolves this by making every prefix of 8 tokens decode to a valid action — so you can stop generating early and still execute safely.
Token Count Explosion with Binning

Drag the sliders to see how token count scales. Notice that OAT stays flat at 8 tokens regardless of horizon or dimensions.

Horizon (T)32
Dimensions (D)7

What "50 Hz control" actually demands

A robot arm needs a new joint-angle command every 20ms. That's 7 floats every 20ms. Modern autoregressive policies output actions in chunks — predict 32 steps at once, execute them, then predict the next chunk. This gives you 32 × 20ms = 640ms to generate the next chunk.

With 8 tokens and ~3ms per token (KV cache, batched decode): 8 × 3ms = 24ms. That's well within the 640ms budget, and even within the 20ms budget if you use just 1 token. This is the "anytime" property OAT is designed to enable.

Why is the naive binning approach (one token per action dimension per timestep) impractical for real-time robot control?

Chapter 1: Three Desiderata

OAT defines three properties that a good action tokenizer should have. The paper calls them P.1, P.2, and P.3. Most existing methods satisfy only one or two. OAT satisfies all three.

P.1 — High Compression

A 32 × 7 action chunk contains 224 continuous values. An ideal tokenizer should represent this with far fewer discrete tokens — say 8. This is roughly a 28× compression ratio. High compression is necessary for fast autoregressive generation.

P.2 — Total Decodability

Every possible sequence of tokens — including sequences the policy has never generated — must decode to a valid, executable action. This sounds obvious, but it's a subtle and critical requirement.

FAST (Frequency-based Action Sequence Tokenization) violates this property. FAST applies BPE (Byte-Pair Encoding) tokenization to flattened action arrays. BPE produces variable-length token sequences with complex internal structure. An arbitrary sequence of FAST tokens may decode to a partial, malformed action chunk — causing the robot controller to crash at runtime. This isn't theoretical: FAST fails on real hardware.

Total decodability means: any K tokens you decode gives you some complete action. No partial parses, no malformed outputs, no runtime crashes.

P.3 — Causal Ordering

In autoregressive generation, the model generates token T1 first, then T2 conditioned on T1, and so on. If tokens are unordered — like in QueST, which uses an unstructured latent bottleneck — then the first token contains no more information than the last. You can't stop early and get a useful action.

Causal ordering means: token T1 encodes the most important (coarse) information. T2 adds the next layer of detail. T8 adds the final fine-grained correction. This is aligned with next-token prediction: the model can condition each new token on everything it already knows.

The analogy: Causal ordering is like progressive JPEG. The first scan of a JPEG gives you a blurry but recognizable image. Each additional scan sharpens it. If you stop early, you still have a recognizable image. If an action tokenizer has causal ordering, stopping early still gives you a valid, useful action.
Which Desiderata Does Each Method Satisfy?

Click each method to see which properties it satisfies and why the missing ones cause problems.

Why all three together matter

P.1 alone (compression) lets you be fast, but if tokens are unordered you can't use partial sequences. P.2 alone (total decodability) makes you safe, but naive binning decodes every sequence yet takes 224 tokens. P.3 alone (ordering) lets you stop early, but only if the partial output decodes to something valid (needs P.2) and only if it's compact enough (needs P.1).

The three properties are interdependent. Only OAT achieves all three.

FAST (Frequency-based Action Sequence Tokenization) satisfies P.1 (compression) and P.3 (ordering) but fails P.2. What does this mean in practice?

Chapter 2: The Tokenizer Architecture

OAT is a learned encoder-decoder. You train it once on robot demonstration data (offline), then use it to tokenize actions for any downstream policy. The tokenizer itself is not the policy — it's a reusable component that any autoregressive policy can plug in.

The bottleneck idea: register tokens

Here is the key architectural trick. The encoder receives two kinds of input: the action tokens (the 32 × 7 action chunk, projected to embeddings) AND a fixed set of K learnable register tokens. These registers are random vectors initialized at training and learned through gradient descent.

After the transformer encoder runs, we throw away the action token outputs and keep only the K register outputs. The registers act as a bottleneck: they must compress the entire action chunk's information into K vectors of dimension d.

Why registers as the bottleneck? Because register tokens can be quantized independently. Each register maps to one discrete token. The K registers become K tokens. You never need to quantize the action tokens directly — they're just input context.

The full data flow

Input
Action chunk a1:32 ∈ ℝ32×7 projected to embeddings ∈ ℝ32×256
Concatenate
Action embeddings + K learnable register tokens r1:K ∈ ℝK×256. Concatenated: (32+K) × 256
Transformer Encoder
Self-attention across all (32+K) tokens. Causal mask on registers (see Ch 5). Action tokens attend freely.
Keep registers only
Discard action token outputs. Keep z1:K ∈ ℝK×4 (project to 4 dims)
FSQ Quantization
z1:K → T1:K ∈ {0,...,999}K. Each register becomes one discrete token.
Decoder
Transformer decoder: T1:K (+ optional MASK tokens for dropped tails) → reconstructed â1:32

Shapes and sizes

StageShapeNotes
Input action chunk32 × 732 timesteps, 7 joint dims
Action embeddings32 × 256Linear projection
Register tokensK × 256K = 8 in OAT8
Encoder output (registers)K × 256After self-attention
Projected registersK × 4Projected to 4 dims for FSQ
Discrete tokensKEach in {0,...,999}
Reconstructed chunk32 × 7Same shape as input
Architecture Flow

Click each module to see the shapes and what it does.

Click a module above to inspect it.

Why not use the action tokens as the bottleneck?

If you quantized every action token output (32 of them), you'd still have 32 tokens — no compression. The registers are a deliberate information bottleneck: K << 32 vectors must represent all information from 32 action timesteps. This forced compression is what gives OAT its efficiency.

The registers also have a nice property: they're position-aware (they have positional embeddings) but not tied to any specific action timestep. Register 1 can aggregate global trend information; register 8 can pick up residual fine details. The encoder is free to allocate information however minimizes reconstruction error.

After the transformer encoder runs in OAT, what happens to the action token outputs?

Chapter 3: FSQ — Finite Scalar Quantization

After the transformer encoder compresses the action chunk into K register vectors, each register must be converted to a single discrete token. This is the quantization step. OAT uses Finite Scalar Quantization (FSQ) — a simpler, more stable alternative to the VQ-VAE codebook.

How FSQ works

Each register vector is projected to a low-dimensional space (say 4 dimensions). Each dimension is then independently quantized to a small number of discrete levels. In OAT, the level configuration is [8, 5, 5, 5]: the first dimension can take 8 values, and the remaining three can each take 5. The total codebook size is 8 × 5 × 5 × 5 = 1000 possible tokens per register.

To quantize a continuous value z ∈ [-1, 1] to L levels: round(z × (L-1)/2) / ((L-1)/2). In practice, a tanh activation bounds the output to [-1, 1] before rounding.

T = round( tanh(z) × (L−1)/2 ) / ((L−1)/2)

Why FSQ beats VQ-VAE

VQ-VAE uses a learned codebook: K embedding vectors are stored, and each latent is replaced by its nearest neighbor in the codebook. This works but has a nasty failure mode called codebook collapse: the model learns to use only a few codebook entries (the rest become dead entries that no data ever maps to). You end up with a much smaller effective vocabulary than intended.

FSQ has no codebook to collapse. Every combination of level values is reachable — the 1000 tokens are defined by the grid structure, not by learned embeddings. All 1000 tokens are always available.

Gradients through quantization

Rounding is not differentiable. To train the encoder end-to-end, FSQ uses the straight-through estimator: during the forward pass, use the rounded value. During backpropagation, pretend the rounding didn't happen (gradient flows through as if it were an identity). This is a mild approximation that works well in practice.

Key advantage of FSQ over VQ-VAE: No codebook collapse, no auxiliary commitment loss, no exponential moving average updates. Just round, and use straight-through gradients. Simpler training, more stable tokenizers.
FSQ Quantization Grid

A 2D slice of the FSQ quantization space. The orange dot is a continuous latent; it snaps to the nearest grid point (teal). Drag the dot or use sliders.

z10.37
z2-0.22

Information capacity

With 8 tokens and 1000 codes each: 10008 ≈ 1024 possible action sequences. This dwarfs the actual number of distinct robot motion trajectories needed — plenty of capacity to represent the action distribution for any realistic task.

Compare with binning: 256 values per dimension × 32 × 7 = 224 tokens. Each token is 1 of 256, so 256224 ≈ 10537 sequences. Far more capacity than needed, at enormous token-count cost. OAT gets the same effective coverage with 8 tokens because the encoder learned which directions in trajectory space actually matter.

Why does FSQ avoid codebook collapse (unlike VQ-VAE)?

Chapter 4: Nested Dropout — The Ordering Trick

Here is OAT's central innovation. The architecture so far (register bottleneck + FSQ) gives us compression and decodability. But nothing forces the registers to organize information in any particular order. Register 3 might encode the coarse trajectory and register 1 might encode a fine detail. The ordering is arbitrary.

We need token 1 to carry the most important information. Tokens 2 through 8 should add progressively finer details. This is P.3. How do you train a model to respect this ordering without manually designing what each token should encode?

The nested dropout trick

During training, after encoding the action chunk to 8 tokens [T1, T2, ..., T8], randomly sample a cutoff K ∈ {1, 2, ..., 8}. Replace tokens TK+1 through T8 with a learned MASK token. Then pass this masked sequence to the decoder. The decoder must reconstruct the full 32 × 7 action chunk from only the first K tokens.

What does the model learn? Token T1 alone must be enough to reconstruct a reasonable action (K=1 is chosen with probability 1/8). Token T2 must contain the next most-useful information (given T1). And so on. The ordering emerges from the training objective without any manual supervision.

The information-theoretic view: Nested dropout forces the encoder to allocate information greedily. Token 1 gets the highest-bandwidth channel (it must serve all values of K). Each subsequent token gets a progressively narrower channel. This is exactly the structure of a progressive code — like wavelets or progressive JPEG.

Why this enables "anytime" execution

At inference time, the autoregressive policy generates tokens one at a time: T1, then T2 | T1, then T3 | T1:2, etc. At any point, you can stop generating and decode the partial sequence T1:K (padding the rest with MASK tokens). Because the training objective trained the decoder to handle exactly this situation, you always get a valid action.

More compute budget → more tokens generated → finer actions. Less time → stop at K=1 → coarser but valid action. This "anytime" property is unique to OAT among all action tokenizers.

The training loss

The overall training objective is reconstruction loss, averaged over random K:

ℒ = 𝔼K~Uniform{1,...,8}[ || a1:32 − Decoder(T1:K, MASKK+1:8) ||2 ]

This MSE loss is averaged over all choices of K, so each token is trained to be useful for reconstruction when used as part of any prefix of any length. No manual design of token roles — the gradient figure it out.

★ Coarse-to-Fine Reconstruction (Figure 2)

This is the key result of OAT. A simulated 2D robot trajectory is encoded into 8 tokens. Drag the slider to reveal more tokens. Watch how the reconstruction sharpens from rough approximation (K=1) to perfect match (K=8).

Tokens used (K)1
Nested dropout trains the decoder to handle partial token sequences. What does this enable at inference time?

Chapter 5: Causal Attention Among Registers

Nested dropout creates an ordering incentive during training, but it's a soft signal — the loss alone doesn't strictly prevent register 2 from attending to register 8 during encoding. To enforce left-to-right information flow more rigorously, OAT applies causal masking among the register tokens.

The asymmetric attention pattern

In the encoder's self-attention, three types of interactions happen between the (32 + K) tokens:

Why this works: Register 1 must summarize the entire action chunk independently, with no help from other registers. Register 2 can build on register 1's summary. Register 8 gets the full picture. This hierarchy of dependencies forces the information to organize coarse-to-fine — exactly P.3.

Combined with nested dropout

Causal attention and nested dropout are complementary. Nested dropout enforces ordering at the level of the decoder (what can be reconstructed from partial tokens). Causal attention enforces ordering at the level of the encoder (how information flows between registers during encoding). Together they create a strong, consistent ordering signal throughout the model.

The ablations in Chapter 8 show that removing either one degrades performance. They're not redundant — they enforce ordering from different angles.

Attention Mask Visualization

Teal = attention allowed, dark = masked out. Toggle between standard (all-to-all) and OAT causal attention patterns. 4 action tokens + 4 registers shown for clarity.

What happens without causal attention

If registers can attend to each other freely, the encoder can route information arbitrarily. Register 1 might pull a summary from register 5 which pulled from register 3. The FSQ quantization then breaks this dependency, but the ordering is no longer guaranteed. In the ablation experiments, removing causal attention drops success rates across all benchmarks.

Interestingly, removing causal attention while keeping nested dropout still produces some ordering (nested dropout alone provides a training signal), but the ordering is weaker. The combination is strictly better than either alone.

In OAT's encoder, register token i can attend to which other tokens?

Chapter 6: The Autoregressive Policy

OAT is not a policy. It's a tokenizer. The policy is a separate model — a standard autoregressive language-model-style transformer — that generates OAT tokens conditioned on observations (images, language, proprioception).

Once you have the OAT tokenizer, training the policy is straightforward: encode all demonstration actions into OAT tokens offline, then train the policy to predict those tokens via next-token prediction on the observation context.

Generation procedure (Algorithm 2)

At inference time, given an observation o (images + language instruction + robot state):

Observe
Receive observation o (camera images, language, joint angles)
Generate T1
Policy outputs p(T1 | o). Sample from this distribution. T1 encodes coarse motion.
Generate T2, ..., TK
Each Ti ~ p(Ti | T1:i-1, o). Continue until K tokens generated (K chosen by compute budget).
Pad with MASK
If K < 8, pad remaining positions with MASK token: [T1:K, MASK, ..., MASK]
Decode
OAT decoder: [T1:K, MASKK+1:8] → â1:32. Execute the 32-step action chunk on the robot.

The anytime tradeoff in practice

The compute budget K gives you direct control over quality vs. speed:

K tokensLatency (est.)Action qualityUse when
1~10msCoarse but validTime-critical, rough motions
4~18msGoodMost tasks
8~27msFull fidelityDexterous precision tasks
Binning (224)~517msGood (if it terminates)Cannot use for real-time

Why this works for VLAs

Modern VLAs (pi-0, OpenVLA, RT-2) all use autoregressive transformers. The policy backbone already knows how to do next-token prediction. OAT tokens can replace any discrete action vocabulary as a drop-in — with the benefit that the token sequence has the ordered structure aligned with next-token prediction.

Specifically: when the policy predicts T2, it already knows T1 (the coarse direction). So it only needs to predict the incremental refinement, not the entire action from scratch. This is a much easier prediction task, which partly explains OAT's stronger performance over unordered tokenizers like QueST.

Anytime Execution Demo

Simulate the "compute budget vs action quality" tradeoff. Drag K to see how a simulated robot trajectory becomes more precise as more tokens are used. Success rate and latency update in real time.

Compute budget K3
If the compute budget only allows K=3 tokens (out of 8), what does OAT do with the remaining 5 positions?

Chapter 7: Results

OAT is evaluated on four robotics benchmarks covering 20+ tasks: LIBERO (long-horizon manipulation), RoboMimic (contact-rich manipulation), MetaWorld (diverse skills), and RoboCasa (household tasks). In each benchmark, OAT8 beats every baseline.

The headline numbers

MethodLIBERORoboMimicMetaWorldRoboCasa
OAT8 (ours)56.3%71.2%78.9%62.1%
QueST48.2%64.7%71.3%55.8%
Diffusion Policy36.6%58.3%63.2%47.4%
FAST23.0%31.4%49.7%28.9%
Binning14.4%22.1%38.5%19.3%

Why each method underperforms

Binning (14.4%) — 224 tokens per action chunk means 224 autoregressive steps per action. By the time the full action is generated, the robot is already behind. The latency alone makes this method impractical for real-time control. The policy also cannot condition later tokens on the full action context.

FAST (23.0%) — Variable-length BPE tokenization means arbitrary token sequences may not decode. When the learned policy generates novel token sequences (inevitable in out-of-distribution states), the decoder fails. The crashes and partial decodes tank the success rate. P.2 violation is not theoretical — it shows up clearly in the results.

Diffusion Policy (36.6%) — Continuous action diffusion works well but is not autoregressive. It cannot be easily combined with VLM backbones, and it lacks the anytime property. OAT enables a simpler, faster policy class (AR transformers) that outperforms diffusion on these benchmarks.

QueST (48.2%) — Close to OAT, and satisfies P.1 and P.2. But the latent bottleneck is unordered. The policy must generate all K tokens to get a useful action (no prefix decoding). More critically, the next-token prediction objective is misaligned — predicting T2 provides no additional semantic meaning beyond "the second unordered latent." OAT's ordering makes each prediction step meaningful.

Latency comparison

The latency gap is stark: OAT8 takes ~27ms to generate an action chunk. Binning takes ~517ms. This isn't a minor engineering difference — 517ms means the robot can only update its actions ~2 times per second, while OAT enables ~37 updates per second.

Results Comparison

Toggle between benchmarks and metrics. OAT8 leads across all four benchmarks.

The monotonic improvement from OAT1 to OAT8

One of OAT's most striking results: performance improves monotonically as K increases from 1 to 8. OAT1 (single token!) achieves ~30% on LIBERO — better than Binning's full 224-token generation. OAT4 reaches ~50%. OAT8 hits 56.3%.

This monotonic improvement is a direct consequence of causal ordering (P.3). Each additional token adds genuine information. If the tokens were unordered (QueST style), adding more tokens wouldn't reliably improve performance — sometimes the K-th token might carry information that contradicts or duplicates earlier tokens.

OAT8 outperforms QueST (48.2% vs 56.3% on LIBERO) despite both having 8 tokens and similar compression. What is the key difference?

Chapter 8: Ablations

The ablation study in Table III of the paper isolates the contribution of each design choice. This is how the authors confirm that their architectural decisions are individually necessary — not just collectively helpful.

What gets ablated

The paper ablates three main components: nested dropout, causal attention among registers, and the FSQ quantizer. Each can be removed independently to see its contribution.

ConfigurationP.1P.2P.3LIBERORoboCasa
OAT (full)56.3%62.1%
No nested dropout×44.1%51.3%
No causal attention(partial)49.7%55.8%
No dropout + no causal×43.5%50.2%
VQ-VAE instead of FSQ51.8%57.4%

Interpreting the numbers

No nested dropout (44.1%) — This is the most damaging removal. Without nested dropout, the decoder only sees all 8 tokens during training, so it never learns to reconstruct from partial sequences. The tokens become unordered. This is essentially QueST with causal attention — and it performs similarly to QueST.

No causal attention (49.7%) — Removing causal attention while keeping nested dropout is less catastrophic. Nested dropout still forces some ordering, but the encoder can route information non-causally between registers. The soft ordering is weaker, and the policy pays for it.

No dropout + no causal (43.5%) — Essentially QueST. Both ordering mechanisms removed. The result is a powerful but unordered tokenizer. Strong compression and decodability, but the autoregressive policy can't exploit the prefix structure.

VQ-VAE (51.8%) — FSQ is better than VQ-VAE, likely due to codebook collapse. VQ-VAE uses only a fraction of its codebook in practice. FSQ uses all 1000 codes uniformly. This 4.5% gap on LIBERO suggests codebook collapse is a real problem in this regime.

The critical finding: Nested dropout is the most important single ingredient. It's responsible for ~12 percentage points on LIBERO. The ordering (P.3) is not a "nice to have" — it's the primary driver of OAT's advantage over QueST and other compressed tokenizers.
Ablation Dashboard

Toggle components on/off to see how each contributes to performance on LIBERO. The full OAT is all three on.

Which ablation has the biggest negative impact on OAT's performance?

Chapter 9: Connections

OAT and the VLA ecosystem

OAT is designed to plug into any autoregressive VLA. The key insight: VLAs like pi-0, RT-2, OpenVLA all use transformer backbones that do next-token prediction. OAT simply provides a better token vocabulary for the action part of the sequence. You keep the entire VLM backbone, the training recipe, the observation encoder — and replace only the action tokenizer.

For VLAs that use flow matching (like pi-0), OAT is not a direct replacement — flow matching generates continuous actions, so it doesn't need tokenization. But for autoregressive VLAs, OAT provides the missing piece: a tokenization scheme that is compact, safe, and ordered.

Progressive coding connections

OAT's causal ordering is mathematically analogous to progressive source coding — a classical information theory concept. In progressive JPEG, each scan refines the image by adding information about high-frequency components. In wavelets, the coarsest scale (low-frequency) comes first. In OAT, token T1 encodes the "DC component" of the action (overall direction), and later tokens encode higher-frequency details (precise finger placements, wrist rotations).

This connection suggests extensions: could you learn a multi-resolution action representation where different tokens correspond to different frequency bands? OAT doesn't explicitly do this, but the trained model discovers something like it spontaneously.

Limitations

OAT is a deterministic tokenizer once trained. The codebook is fixed. If a task requires actions outside the training distribution, the tokenizer may quantize poorly. Unlike diffusion, it cannot easily represent multi-modal action distributions (two valid ways to grasp an object) — the FSQ grid has limited capacity to encode bi-modality in a single token.

OAT also requires training a separate tokenizer on a corpus of robot demonstrations before training the policy. This adds a preprocessing step. If the tokenizer training data doesn't cover the target task distribution, the tokens may not be expressive enough.

Cheat sheet

ConceptOAT Answer
Compression (P.1)32×7=224 values → 8 tokens (28× compression)
Decodability (P.2)FSQ: any K tokens decode via learned decoder
Ordering (P.3)Nested dropout + causal register attention
Codebook size1000 per token ([8,5,5,5] FSQ levels)
QuantizerFSQ (not VQ-VAE — avoids codebook collapse)
BottleneckK=8 learnable register tokens in encoder
Training signalMSE reconstruction loss, averaged over K~Uniform{1,...,8}
Latency (OAT8)~27ms per action chunk
Latency (OAT1)~10ms per action chunk
Latency (Binning)~517ms per action chunk

Further reading

If you want to go deeper on the pieces OAT builds on:

The one-line takeaway: OAT is the first action tokenizer that compresses action chunks aggressively (P.1), guarantees every token sequence decodes safely (P.2), and orders tokens coarse-to-fine (P.3) — enabling anytime execution and outperforming all prior tokenizers across 20+ robotics tasks.

Based on "OAT: Ordered Action Tokenization for Autoregressive Robot Policies" by Chaoqi Liu, Xiaoshen Han, Jiawei Gao, Yue Zhao, Haonan Chen, Yilun Du (Harvard University + Stanford University, 2026). arXiv:2602.04215Project page