A learned tokenizer that converts 32-step robot action chunks into just 8 ordered discrete tokens — where the first token alone gives you a valid (coarse) action, and each additional token sharpens the motion like progressive JPEG.
A robot arm picks up objects by moving 7 joints through space — each joint has a continuous angle that changes 50 times per second. A 32-timestep action chunk (about 0.6 seconds of motion) is therefore a matrix of 32 × 7 = 224 continuous floating-point numbers.
Modern robot policies built on language models need to output these actions as discrete tokens — the same way a language model outputs words. But there is no "word vocabulary" for motor commands. You have to build one. And that choice turns out to matter enormously.
The obvious solution is binning: divide each continuous dimension into N discrete levels (say 256 bins), then represent each of the 224 values as one token. A single action chunk becomes 224 tokens. The transformer must predict them one at a time.
224 tokens per action chunk is a disaster in practice. Language models predict one token per forward pass. If each forward pass takes ~10ms on modern hardware, generating one action chunk takes 224 × 10ms = 2.24 seconds. But the robot needs a new action every 20ms (50 Hz). The math doesn't work.
The solution is compression: represent the action chunk as far fewer tokens. If we can express 32 × 7 = 224 values with just 8 tokens, each forward pass is 8 × 10ms = 80ms — still too slow per-chunk, but KV caching and parallel decoding bring this to ~27ms, making real-time control feasible.
But compression is not the only challenge. The way you design the token vocabulary shapes whether the resulting policy is safe, useful, and efficient. This lesson is about designing that vocabulary correctly.
Drag the sliders to see how token count scales. Notice that OAT stays flat at 8 tokens regardless of horizon or dimensions.
A robot arm needs a new joint-angle command every 20ms. That's 7 floats every 20ms. Modern autoregressive policies output actions in chunks — predict 32 steps at once, execute them, then predict the next chunk. This gives you 32 × 20ms = 640ms to generate the next chunk.
With 8 tokens and ~3ms per token (KV cache, batched decode): 8 × 3ms = 24ms. That's well within the 640ms budget, and even within the 20ms budget if you use just 1 token. This is the "anytime" property OAT is designed to enable.
OAT defines three properties that a good action tokenizer should have. The paper calls them P.1, P.2, and P.3. Most existing methods satisfy only one or two. OAT satisfies all three.
A 32 × 7 action chunk contains 224 continuous values. An ideal tokenizer should represent this with far fewer discrete tokens — say 8. This is roughly a 28× compression ratio. High compression is necessary for fast autoregressive generation.
Every possible sequence of tokens — including sequences the policy has never generated — must decode to a valid, executable action. This sounds obvious, but it's a subtle and critical requirement.
FAST (Frequency-based Action Sequence Tokenization) violates this property. FAST applies BPE (Byte-Pair Encoding) tokenization to flattened action arrays. BPE produces variable-length token sequences with complex internal structure. An arbitrary sequence of FAST tokens may decode to a partial, malformed action chunk — causing the robot controller to crash at runtime. This isn't theoretical: FAST fails on real hardware.
Total decodability means: any K tokens you decode gives you some complete action. No partial parses, no malformed outputs, no runtime crashes.
In autoregressive generation, the model generates token T1 first, then T2 conditioned on T1, and so on. If tokens are unordered — like in QueST, which uses an unstructured latent bottleneck — then the first token contains no more information than the last. You can't stop early and get a useful action.
Causal ordering means: token T1 encodes the most important (coarse) information. T2 adds the next layer of detail. T8 adds the final fine-grained correction. This is aligned with next-token prediction: the model can condition each new token on everything it already knows.
Click each method to see which properties it satisfies and why the missing ones cause problems.
P.1 alone (compression) lets you be fast, but if tokens are unordered you can't use partial sequences. P.2 alone (total decodability) makes you safe, but naive binning decodes every sequence yet takes 224 tokens. P.3 alone (ordering) lets you stop early, but only if the partial output decodes to something valid (needs P.2) and only if it's compact enough (needs P.1).
The three properties are interdependent. Only OAT achieves all three.
OAT is a learned encoder-decoder. You train it once on robot demonstration data (offline), then use it to tokenize actions for any downstream policy. The tokenizer itself is not the policy — it's a reusable component that any autoregressive policy can plug in.
Here is the key architectural trick. The encoder receives two kinds of input: the action tokens (the 32 × 7 action chunk, projected to embeddings) AND a fixed set of K learnable register tokens. These registers are random vectors initialized at training and learned through gradient descent.
After the transformer encoder runs, we throw away the action token outputs and keep only the K register outputs. The registers act as a bottleneck: they must compress the entire action chunk's information into K vectors of dimension d.
Why registers as the bottleneck? Because register tokens can be quantized independently. Each register maps to one discrete token. The K registers become K tokens. You never need to quantize the action tokens directly — they're just input context.
| Stage | Shape | Notes |
|---|---|---|
| Input action chunk | 32 × 7 | 32 timesteps, 7 joint dims |
| Action embeddings | 32 × 256 | Linear projection |
| Register tokens | K × 256 | K = 8 in OAT8 |
| Encoder output (registers) | K × 256 | After self-attention |
| Projected registers | K × 4 | Projected to 4 dims for FSQ |
| Discrete tokens | K | Each in {0,...,999} |
| Reconstructed chunk | 32 × 7 | Same shape as input |
Click each module to see the shapes and what it does.
If you quantized every action token output (32 of them), you'd still have 32 tokens — no compression. The registers are a deliberate information bottleneck: K << 32 vectors must represent all information from 32 action timesteps. This forced compression is what gives OAT its efficiency.
The registers also have a nice property: they're position-aware (they have positional embeddings) but not tied to any specific action timestep. Register 1 can aggregate global trend information; register 8 can pick up residual fine details. The encoder is free to allocate information however minimizes reconstruction error.
After the transformer encoder compresses the action chunk into K register vectors, each register must be converted to a single discrete token. This is the quantization step. OAT uses Finite Scalar Quantization (FSQ) — a simpler, more stable alternative to the VQ-VAE codebook.
Each register vector is projected to a low-dimensional space (say 4 dimensions). Each dimension is then independently quantized to a small number of discrete levels. In OAT, the level configuration is [8, 5, 5, 5]: the first dimension can take 8 values, and the remaining three can each take 5. The total codebook size is 8 × 5 × 5 × 5 = 1000 possible tokens per register.
To quantize a continuous value z ∈ [-1, 1] to L levels: round(z × (L-1)/2) / ((L-1)/2). In practice, a tanh activation bounds the output to [-1, 1] before rounding.
VQ-VAE uses a learned codebook: K embedding vectors are stored, and each latent is replaced by its nearest neighbor in the codebook. This works but has a nasty failure mode called codebook collapse: the model learns to use only a few codebook entries (the rest become dead entries that no data ever maps to). You end up with a much smaller effective vocabulary than intended.
FSQ has no codebook to collapse. Every combination of level values is reachable — the 1000 tokens are defined by the grid structure, not by learned embeddings. All 1000 tokens are always available.
Rounding is not differentiable. To train the encoder end-to-end, FSQ uses the straight-through estimator: during the forward pass, use the rounded value. During backpropagation, pretend the rounding didn't happen (gradient flows through as if it were an identity). This is a mild approximation that works well in practice.
A 2D slice of the FSQ quantization space. The orange dot is a continuous latent; it snaps to the nearest grid point (teal). Drag the dot or use sliders.
With 8 tokens and 1000 codes each: 10008 ≈ 1024 possible action sequences. This dwarfs the actual number of distinct robot motion trajectories needed — plenty of capacity to represent the action distribution for any realistic task.
Compare with binning: 256 values per dimension × 32 × 7 = 224 tokens. Each token is 1 of 256, so 256224 ≈ 10537 sequences. Far more capacity than needed, at enormous token-count cost. OAT gets the same effective coverage with 8 tokens because the encoder learned which directions in trajectory space actually matter.
Here is OAT's central innovation. The architecture so far (register bottleneck + FSQ) gives us compression and decodability. But nothing forces the registers to organize information in any particular order. Register 3 might encode the coarse trajectory and register 1 might encode a fine detail. The ordering is arbitrary.
We need token 1 to carry the most important information. Tokens 2 through 8 should add progressively finer details. This is P.3. How do you train a model to respect this ordering without manually designing what each token should encode?
During training, after encoding the action chunk to 8 tokens [T1, T2, ..., T8], randomly sample a cutoff K ∈ {1, 2, ..., 8}. Replace tokens TK+1 through T8 with a learned MASK token. Then pass this masked sequence to the decoder. The decoder must reconstruct the full 32 × 7 action chunk from only the first K tokens.
What does the model learn? Token T1 alone must be enough to reconstruct a reasonable action (K=1 is chosen with probability 1/8). Token T2 must contain the next most-useful information (given T1). And so on. The ordering emerges from the training objective without any manual supervision.
At inference time, the autoregressive policy generates tokens one at a time: T1, then T2 | T1, then T3 | T1:2, etc. At any point, you can stop generating and decode the partial sequence T1:K (padding the rest with MASK tokens). Because the training objective trained the decoder to handle exactly this situation, you always get a valid action.
More compute budget → more tokens generated → finer actions. Less time → stop at K=1 → coarser but valid action. This "anytime" property is unique to OAT among all action tokenizers.
The overall training objective is reconstruction loss, averaged over random K:
This MSE loss is averaged over all choices of K, so each token is trained to be useful for reconstruction when used as part of any prefix of any length. No manual design of token roles — the gradient figure it out.
This is the key result of OAT. A simulated 2D robot trajectory is encoded into 8 tokens. Drag the slider to reveal more tokens. Watch how the reconstruction sharpens from rough approximation (K=1) to perfect match (K=8).
Nested dropout creates an ordering incentive during training, but it's a soft signal — the loss alone doesn't strictly prevent register 2 from attending to register 8 during encoding. To enforce left-to-right information flow more rigorously, OAT applies causal masking among the register tokens.
In the encoder's self-attention, three types of interactions happen between the (32 + K) tokens:
Causal attention and nested dropout are complementary. Nested dropout enforces ordering at the level of the decoder (what can be reconstructed from partial tokens). Causal attention enforces ordering at the level of the encoder (how information flows between registers during encoding). Together they create a strong, consistent ordering signal throughout the model.
The ablations in Chapter 8 show that removing either one degrades performance. They're not redundant — they enforce ordering from different angles.
Teal = attention allowed, dark = masked out. Toggle between standard (all-to-all) and OAT causal attention patterns. 4 action tokens + 4 registers shown for clarity.
If registers can attend to each other freely, the encoder can route information arbitrarily. Register 1 might pull a summary from register 5 which pulled from register 3. The FSQ quantization then breaks this dependency, but the ordering is no longer guaranteed. In the ablation experiments, removing causal attention drops success rates across all benchmarks.
Interestingly, removing causal attention while keeping nested dropout still produces some ordering (nested dropout alone provides a training signal), but the ordering is weaker. The combination is strictly better than either alone.
OAT is not a policy. It's a tokenizer. The policy is a separate model — a standard autoregressive language-model-style transformer — that generates OAT tokens conditioned on observations (images, language, proprioception).
Once you have the OAT tokenizer, training the policy is straightforward: encode all demonstration actions into OAT tokens offline, then train the policy to predict those tokens via next-token prediction on the observation context.
At inference time, given an observation o (images + language instruction + robot state):
The compute budget K gives you direct control over quality vs. speed:
| K tokens | Latency (est.) | Action quality | Use when |
|---|---|---|---|
| 1 | ~10ms | Coarse but valid | Time-critical, rough motions |
| 4 | ~18ms | Good | Most tasks |
| 8 | ~27ms | Full fidelity | Dexterous precision tasks |
| Binning (224) | ~517ms | Good (if it terminates) | Cannot use for real-time |
Modern VLAs (pi-0, OpenVLA, RT-2) all use autoregressive transformers. The policy backbone already knows how to do next-token prediction. OAT tokens can replace any discrete action vocabulary as a drop-in — with the benefit that the token sequence has the ordered structure aligned with next-token prediction.
Specifically: when the policy predicts T2, it already knows T1 (the coarse direction). So it only needs to predict the incremental refinement, not the entire action from scratch. This is a much easier prediction task, which partly explains OAT's stronger performance over unordered tokenizers like QueST.
Simulate the "compute budget vs action quality" tradeoff. Drag K to see how a simulated robot trajectory becomes more precise as more tokens are used. Success rate and latency update in real time.
OAT is evaluated on four robotics benchmarks covering 20+ tasks: LIBERO (long-horizon manipulation), RoboMimic (contact-rich manipulation), MetaWorld (diverse skills), and RoboCasa (household tasks). In each benchmark, OAT8 beats every baseline.
| Method | LIBERO | RoboMimic | MetaWorld | RoboCasa |
|---|---|---|---|---|
| OAT8 (ours) | 56.3% | 71.2% | 78.9% | 62.1% |
| QueST | 48.2% | 64.7% | 71.3% | 55.8% |
| Diffusion Policy | 36.6% | 58.3% | 63.2% | 47.4% |
| FAST | 23.0% | 31.4% | 49.7% | 28.9% |
| Binning | 14.4% | 22.1% | 38.5% | 19.3% |
Binning (14.4%) — 224 tokens per action chunk means 224 autoregressive steps per action. By the time the full action is generated, the robot is already behind. The latency alone makes this method impractical for real-time control. The policy also cannot condition later tokens on the full action context.
FAST (23.0%) — Variable-length BPE tokenization means arbitrary token sequences may not decode. When the learned policy generates novel token sequences (inevitable in out-of-distribution states), the decoder fails. The crashes and partial decodes tank the success rate. P.2 violation is not theoretical — it shows up clearly in the results.
Diffusion Policy (36.6%) — Continuous action diffusion works well but is not autoregressive. It cannot be easily combined with VLM backbones, and it lacks the anytime property. OAT enables a simpler, faster policy class (AR transformers) that outperforms diffusion on these benchmarks.
QueST (48.2%) — Close to OAT, and satisfies P.1 and P.2. But the latent bottleneck is unordered. The policy must generate all K tokens to get a useful action (no prefix decoding). More critically, the next-token prediction objective is misaligned — predicting T2 provides no additional semantic meaning beyond "the second unordered latent." OAT's ordering makes each prediction step meaningful.
The latency gap is stark: OAT8 takes ~27ms to generate an action chunk. Binning takes ~517ms. This isn't a minor engineering difference — 517ms means the robot can only update its actions ~2 times per second, while OAT enables ~37 updates per second.
Toggle between benchmarks and metrics. OAT8 leads across all four benchmarks.
One of OAT's most striking results: performance improves monotonically as K increases from 1 to 8. OAT1 (single token!) achieves ~30% on LIBERO — better than Binning's full 224-token generation. OAT4 reaches ~50%. OAT8 hits 56.3%.
This monotonic improvement is a direct consequence of causal ordering (P.3). Each additional token adds genuine information. If the tokens were unordered (QueST style), adding more tokens wouldn't reliably improve performance — sometimes the K-th token might carry information that contradicts or duplicates earlier tokens.
The ablation study in Table III of the paper isolates the contribution of each design choice. This is how the authors confirm that their architectural decisions are individually necessary — not just collectively helpful.
The paper ablates three main components: nested dropout, causal attention among registers, and the FSQ quantizer. Each can be removed independently to see its contribution.
| Configuration | P.1 | P.2 | P.3 | LIBERO | RoboCasa |
|---|---|---|---|---|---|
| OAT (full) | ✓ | ✓ | ✓ | 56.3% | 62.1% |
| No nested dropout | ✓ | ✓ | × | 44.1% | 51.3% |
| No causal attention | ✓ | ✓ | (partial) | 49.7% | 55.8% |
| No dropout + no causal | ✓ | ✓ | × | 43.5% | 50.2% |
| VQ-VAE instead of FSQ | ✓ | ✓ | ✓ | 51.8% | 57.4% |
No nested dropout (44.1%) — This is the most damaging removal. Without nested dropout, the decoder only sees all 8 tokens during training, so it never learns to reconstruct from partial sequences. The tokens become unordered. This is essentially QueST with causal attention — and it performs similarly to QueST.
No causal attention (49.7%) — Removing causal attention while keeping nested dropout is less catastrophic. Nested dropout still forces some ordering, but the encoder can route information non-causally between registers. The soft ordering is weaker, and the policy pays for it.
No dropout + no causal (43.5%) — Essentially QueST. Both ordering mechanisms removed. The result is a powerful but unordered tokenizer. Strong compression and decodability, but the autoregressive policy can't exploit the prefix structure.
VQ-VAE (51.8%) — FSQ is better than VQ-VAE, likely due to codebook collapse. VQ-VAE uses only a fraction of its codebook in practice. FSQ uses all 1000 codes uniformly. This 4.5% gap on LIBERO suggests codebook collapse is a real problem in this regime.
Toggle components on/off to see how each contributes to performance on LIBERO. The full OAT is all three on.
OAT is designed to plug into any autoregressive VLA. The key insight: VLAs like pi-0, RT-2, OpenVLA all use transformer backbones that do next-token prediction. OAT simply provides a better token vocabulary for the action part of the sequence. You keep the entire VLM backbone, the training recipe, the observation encoder — and replace only the action tokenizer.
For VLAs that use flow matching (like pi-0), OAT is not a direct replacement — flow matching generates continuous actions, so it doesn't need tokenization. But for autoregressive VLAs, OAT provides the missing piece: a tokenization scheme that is compact, safe, and ordered.
OAT's causal ordering is mathematically analogous to progressive source coding — a classical information theory concept. In progressive JPEG, each scan refines the image by adding information about high-frequency components. In wavelets, the coarsest scale (low-frequency) comes first. In OAT, token T1 encodes the "DC component" of the action (overall direction), and later tokens encode higher-frequency details (precise finger placements, wrist rotations).
This connection suggests extensions: could you learn a multi-resolution action representation where different tokens correspond to different frequency bands? OAT doesn't explicitly do this, but the trained model discovers something like it spontaneously.
OAT is a deterministic tokenizer once trained. The codebook is fixed. If a task requires actions outside the training distribution, the tokenizer may quantize poorly. Unlike diffusion, it cannot easily represent multi-modal action distributions (two valid ways to grasp an object) — the FSQ grid has limited capacity to encode bi-modality in a single token.
OAT also requires training a separate tokenizer on a corpus of robot demonstrations before training the policy. This adds a preprocessing step. If the tokenizer training data doesn't cover the target task distribution, the tokens may not be expressive enough.
| Concept | OAT Answer |
|---|---|
| Compression (P.1) | 32×7=224 values → 8 tokens (28× compression) |
| Decodability (P.2) | FSQ: any K tokens decode via learned decoder |
| Ordering (P.3) | Nested dropout + causal register attention |
| Codebook size | 1000 per token ([8,5,5,5] FSQ levels) |
| Quantizer | FSQ (not VQ-VAE — avoids codebook collapse) |
| Bottleneck | K=8 learnable register tokens in encoder |
| Training signal | MSE reconstruction loss, averaged over K~Uniform{1,...,8} |
| Latency (OAT8) | ~27ms per action chunk |
| Latency (OAT1) | ~10ms per action chunk |
| Latency (Binning) | ~517ms per action chunk |
If you want to go deeper on the pieces OAT builds on:
Based on "OAT: Ordered Action Tokenization for Autoregressive Robot Policies" by Chaoqi Liu, Xiaoshen Han, Jiawei Gao, Yue Zhao, Haonan Chen, Yilun Du (Harvard University + Stanford University, 2026). arXiv:2602.04215 — Project page