Why do VLAs fail at dexterous tasks? Because naive action tokenization destroys high-frequency information. FAST fixes this with compression-based tokenization using the discrete cosine transform — producing 5-10x fewer tokens and enabling 5x faster training.
Autoregressive VLAs like RT-2 and OpenVLA work by predicting robot actions as discrete tokens — just like language models predict word tokens. But how do you convert a continuous action (like "move joint 3 by 0.0342 radians") into a discrete token?
The standard approach is naive binning: divide each action dimension into 256 equally-spaced bins. Joint angle 0.0342 maps to bin 137. The model predicts "137" as a token. Simple.
This works fine for low-frequency control (5-10 Hz, like RT-2's pick-and-place). But it completely breaks down for high-frequency dexterous tasks (50 Hz, like folding laundry). The paper demonstrates this with a striking experiment: as control frequency increases, naive-tokenized VLAs stop learning entirely — they just copy the first action token over and over.
Drag the frequency slider. At low Hz, naive binning works. At high Hz, consecutive tokens become identical and the model collapses to copying.
To understand the failure deeply, consider the autoregressive training objective. The model is trained to predict the next token Ti given all previous tokens T1:i-1. The learning signal is proportional to the marginal information content of Ti given T1:i-1.
With naive per-timestep binning at high frequencies:
The model sees the same token repeated. Predicting "same as before" achieves near-zero loss. The model has no incentive to learn the actual underlying motion — the tokenization has destroyed the signal.
The paper proves this with a controlled experiment: train the same model on the same data at different sampling rates (25 Hz to 800 Hz). At low rates, it works. At high rates, it collapses to copying. The data itself hasn't changed — only the tokenization has.
The solution comes from signal processing. The Discrete Cosine Transform (DCT) decomposes a signal into a sum of cosine waves at different frequencies. Low-frequency components capture the overall shape; high-frequency components capture sharp details.
Robot actions are smooth signals — the robot's joints don't teleport. This means most of the information is in the low-frequency DCT coefficients. The high-frequency coefficients are near zero and can be dropped.
This is the exact same principle behind JPEG image compression: pixels vary smoothly, so most DCT coefficients are near zero, so images compress well. Robot actions compress even better because they're 1D time series (smoother than 2D images).
A smooth action trajectory has most energy in low frequencies. Drag the cutoff to see how many coefficients you actually need.
FAST (Frequency-space Action Sequence Tokenization) turns DCT compression into a tokenization pipeline:
The key insight in Step 3: after DCT and scaling, the coefficient matrix is sparse — most entries are zero. The non-zero entries capture all the important information. Step 4 uses BPE (the same technique used in language model tokenizers like tiktoken) to further compress common patterns in the coefficient sequence.
FAST with manually-chosen parameters works well, but the authors go further: they train a universal tokenizer called FAST+ on 1 million real robot action trajectories spanning diverse robot types, action spaces, and control frequencies.
FAST+ learns the optimal BPE vocabulary across all these different robots. Once trained, it works as a black-box tokenizer — feed it any robot's action sequence, and it produces efficient tokens without any per-robot tuning.
| Dimension | Coverage |
|---|---|
| Robot types | Single-arm, dual-arm, mobile manipulators, humanoids |
| Action spaces | Joint positions, joint velocities, Cartesian deltas, gripper commands |
| Control frequencies | 5 Hz to 200 Hz |
| Training data | 1M real trajectories from diverse sources |
FAST doesn't just enable new tasks — it makes training dramatically faster. The key: fewer tokens per action chunk means fewer autoregressive prediction steps per training example.
With naive binning on a 50-step action chunk across 7 dimensions: 350 tokens per example. With FAST: ~50-70 tokens. That's 5-7x fewer tokens. Since autoregressive training cost scales linearly with sequence length, FAST training is up to 5x faster.
But the speedup isn't just from fewer tokens. Each FAST token carries more information than a naive bin token. A single FAST token might encode the low-frequency shape of an entire action dimension, while a naive token encodes just one value at one timestep. The autoregressive objective becomes more meaningful — each token prediction contributes real information.
The paper demonstrates FAST on several challenging benchmarks:
When FAST tokenization is integrated with the pi-0 VLA architecture, the resulting pi-0-FAST model matches the performance of the original pi-0 (which uses flow matching / diffusion) across dexterous manipulation tasks — while training 5x faster. This is remarkable: FAST makes autoregressive VLAs competitive with diffusion VLAs for the first time on high-frequency tasks.
The DROID dataset is a large-scale "in-the-wild" manipulation dataset. Previous autoregressive VLAs (including OpenVLA) couldn't successfully train on DROID due to its high-frequency control (15 Hz with action chunks). FAST enables the first successful language-conditioned generalist policy trained on DROID that generalizes zero-shot to unseen environments.
FAST plays a critical role in the pi-0 model family's two-stage training recipe:
FAST enables the pre-training stage to work. Without it, the autoregressive pre-training would fail on high-frequency data. With it, the model can learn from ALL data sources using standard next-token prediction — the same objective that works for language models.
This is the secret ingredient behind pi-0.5's success: FAST tokenization during pre-training lets the model absorb knowledge from diverse data sources, and flow matching during post-training gives it the precision for dexterous tasks.
FAST sits at the intersection of three big ideas:
| Domain | Parallel |
|---|---|
| Language | BPE (byte-pair encoding) compresses text into meaningful subword tokens. FAST compresses actions into meaningful frequency tokens. |
| Images | JPEG uses DCT to compress images by dropping high-frequency details. FAST does the same for robot actions. |
| Audio | MP3 uses frequency-domain compression. Robot actions at 50 Hz are essentially a low-bandwidth audio signal. |
The paper makes a broader point: tokenization matters as much for robots as it does for language. BPE was a turning point for language models. FAST may be the same for robot foundation models.