Physical Intelligence + UC Berkeley, 2025

FAST: Efficient Action
Tokenization

Why do VLAs fail at dexterous tasks? Because naive action tokenization destroys high-frequency information. FAST fixes this with compression-based tokenization using the discrete cosine transform — producing 5-10x fewer tokens and enabling 5x faster training.

Prerequisites: VLA basics + Tokenization concepts
9
Chapters
4+
Simulations

Chapter 0: The Problem

Autoregressive VLAs like RT-2 and OpenVLA work by predicting robot actions as discrete tokens — just like language models predict word tokens. But how do you convert a continuous action (like "move joint 3 by 0.0342 radians") into a discrete token?

The standard approach is naive binning: divide each action dimension into 256 equally-spaced bins. Joint angle 0.0342 maps to bin 137. The model predicts "137" as a token. Simple.

This works fine for low-frequency control (5-10 Hz, like RT-2's pick-and-place). But it completely breaks down for high-frequency dexterous tasks (50 Hz, like folding laundry). The paper demonstrates this with a striking experiment: as control frequency increases, naive-tokenized VLAs stop learning entirely — they just copy the first action token over and over.

The failure is fundamental, not a bug: At high frequencies, consecutive actions are nearly identical (the robot barely moves in 20ms). So consecutive tokens are nearly identical too. The autoregressive model learns that "the next token is probably the same as the last one" — and gets stuck in this trivial solution. The marginal information per token approaches zero.
High-Frequency Tokenization Collapse

Drag the frequency slider. At low Hz, naive binning works. At high Hz, consecutive tokens become identical and the model collapses to copying.

Control freq10 Hz
Why does naive per-timestep binning fail at high control frequencies?

Chapter 1: Why Binning Fails

To understand the failure deeply, consider the autoregressive training objective. The model is trained to predict the next token Ti given all previous tokens T1:i-1. The learning signal is proportional to the marginal information content of Ti given T1:i-1.

With naive per-timestep binning at high frequencies:

The model sees the same token repeated. Predicting "same as before" achieves near-zero loss. The model has no incentive to learn the actual underlying motion — the tokenization has destroyed the signal.

The analogy from language: Imagine tokenizing English text one character at a time, but the text is "aaaaaaaaabbbbbbbbb." The model learns "predict the same letter" and stops. This is exactly what happens with high-frequency action data — the tokenization creates long runs of identical tokens.

The paper proves this with a controlled experiment: train the same model on the same data at different sampling rates (25 Hz to 800 Hz). At low rates, it works. At high rates, it collapses to copying. The data itself hasn't changed — only the tokenization has.

In the paper's controlled experiment, what changed as sampling rate increased from 25 Hz to 800 Hz?

Chapter 2: DCT Compression

The solution comes from signal processing. The Discrete Cosine Transform (DCT) decomposes a signal into a sum of cosine waves at different frequencies. Low-frequency components capture the overall shape; high-frequency components capture sharp details.

Robot actions are smooth signals — the robot's joints don't teleport. This means most of the information is in the low-frequency DCT coefficients. The high-frequency coefficients are near zero and can be dropped.

This is the exact same principle behind JPEG image compression: pixels vary smoothly, so most DCT coefficients are near zero, so images compress well. Robot actions compress even better because they're 1D time series (smoother than 2D images).

DCT: From Time Domain to Frequency Domain

A smooth action trajectory has most energy in low frequencies. Drag the cutoff to see how many coefficients you actually need.

Keep top N coeffs5
Compression ratio: A 50-timestep action chunk across 7 dimensions = 350 values with naive tokenization. After DCT compression, the same information is captured in ~30-50 non-zero coefficients. That's 7-10x compression — and 7-10x fewer tokens for the autoregressive model to predict.
Why does DCT work well for compressing robot actions?

Chapter 3: The FAST Algorithm

FAST (Frequency-space Action Sequence Tokenization) turns DCT compression into a tokenization pipeline:

Step 1: Normalize
Scale each action dimension to [-1, 1] using 1st/99th percentile (robust to outliers)
Step 2: DCT
Apply discrete cosine transform to each action dimension separately
Step 3: Scale & Round
Multiply by a scaling factor, round to integers. Most coefficients become 0 (sparse!)
Step 4: Flatten & BPE
Serialize the sparse matrix, apply byte-pair encoding to merge common patterns into tokens
Result
~30-70 discrete tokens (vs 350+ with naive binning) with higher information content each

The key insight in Step 3: after DCT and scaling, the coefficient matrix is sparse — most entries are zero. The non-zero entries capture all the important information. Step 4 uses BPE (the same technique used in language model tokenizers like tiktoken) to further compress common patterns in the coefficient sequence.

The scaling factor is the only hyperparameter. It controls the tradeoff between lossiness and compression. Larger scaling = more non-zero coefficients = less lossy but more tokens. Smaller scaling = more zeros = more lossy but fewer tokens. The paper finds the sweet spot where reconstruction quality is excellent with 5-10x compression.
What makes the DCT coefficient matrix sparse (mostly zeros) after the scale-and-round step?

Chapter 4: FAST+ Universal Tokenizer

FAST with manually-chosen parameters works well, but the authors go further: they train a universal tokenizer called FAST+ on 1 million real robot action trajectories spanning diverse robot types, action spaces, and control frequencies.

FAST+ learns the optimal BPE vocabulary across all these different robots. Once trained, it works as a black-box tokenizer — feed it any robot's action sequence, and it produces efficient tokens without any per-robot tuning.

What FAST+ covers

DimensionCoverage
Robot typesSingle-arm, dual-arm, mobile manipulators, humanoids
Action spacesJoint positions, joint velocities, Cartesian deltas, gripper commands
Control frequencies5 Hz to 200 Hz
Training data1M real trajectories from diverse sources
Like tiktoken for robots: Just as tiktoken is a universal text tokenizer that works for any language or domain, FAST+ is a universal action tokenizer that works for any robot or task. You don't need to design a new tokenizer for each robot — FAST+ handles it.
What is FAST+ and why is it useful?

Chapter 5: Training Speedup

FAST doesn't just enable new tasks — it makes training dramatically faster. The key: fewer tokens per action chunk means fewer autoregressive prediction steps per training example.

With naive binning on a 50-step action chunk across 7 dimensions: 350 tokens per example. With FAST: ~50-70 tokens. That's 5-7x fewer tokens. Since autoregressive training cost scales linearly with sequence length, FAST training is up to 5x faster.

But the speedup isn't just from fewer tokens. Each FAST token carries more information than a naive bin token. A single FAST token might encode the low-frequency shape of an entire action dimension, while a naive token encodes just one value at one timestep. The autoregressive objective becomes more meaningful — each token prediction contributes real information.

The double win: FAST produces (1) fewer tokens AND (2) more informative tokens. Fewer tokens = faster training. More informative tokens = better learning signal. This is why FAST doesn't just match naive tokenization quality while being faster — it actually achieves BETTER quality on high-frequency tasks where naive tokenization fails entirely.
Why does FAST enable up to 5x faster VLA training?

Chapter 6: Results

The paper demonstrates FAST on several challenging benchmarks:

pi-0-FAST matches pi-0 (diffusion)

When FAST tokenization is integrated with the pi-0 VLA architecture, the resulting pi-0-FAST model matches the performance of the original pi-0 (which uses flow matching / diffusion) across dexterous manipulation tasks — while training 5x faster. This is remarkable: FAST makes autoregressive VLAs competitive with diffusion VLAs for the first time on high-frequency tasks.

DROID dataset — first successful training

The DROID dataset is a large-scale "in-the-wild" manipulation dataset. Previous autoregressive VLAs (including OpenVLA) couldn't successfully train on DROID due to its high-frequency control (15 Hz with action chunks). FAST enables the first successful language-conditioned generalist policy trained on DROID that generalizes zero-shot to unseen environments.

FAST vs Naive Binning: Training Efficiency
First of its kind: The FAST-trained policy on DROID is the first language-conditioned generalist manipulation policy that can be evaluated zero-shot in completely unseen environments by simply prompting it in natural language.
What did FAST enable for the first time on the DROID dataset?

Chapter 7: Integration with pi-0

FAST plays a critical role in the pi-0 model family's two-stage training recipe:

Pre-training
Use FAST tokens (discrete, autoregressive) on ALL data sources — robot data, web data, language, subtask prediction
Post-training
Switch to flow matching (continuous) for precise, high-frequency control on task-specific data

FAST enables the pre-training stage to work. Without it, the autoregressive pre-training would fail on high-frequency data. With it, the model can learn from ALL data sources using standard next-token prediction — the same objective that works for language models.

This is the secret ingredient behind pi-0.5's success: FAST tokenization during pre-training lets the model absorb knowledge from diverse data sources, and flow matching during post-training gives it the precision for dexterous tasks.

FAST is the bridge: It connects the world of autoregressive language models (which need discrete tokens) with the world of robot control (which needs continuous precision). Pre-train with FAST tokens for breadth, post-train with flow matching for depth.
Why is FAST used in pre-training but flow matching in post-training?

Chapter 8: Connections

FAST sits at the intersection of three big ideas:

DomainParallel
LanguageBPE (byte-pair encoding) compresses text into meaningful subword tokens. FAST compresses actions into meaningful frequency tokens.
ImagesJPEG uses DCT to compress images by dropping high-frequency details. FAST does the same for robot actions.
AudioMP3 uses frequency-domain compression. Robot actions at 50 Hz are essentially a low-bandwidth audio signal.

The paper makes a broader point: tokenization matters as much for robots as it does for language. BPE was a turning point for language models. FAST may be the same for robot foundation models.

Related lessons: pi-0pi-0.5Gleams: TransformerGleams: VLA
"A good choice of tokenization can be critical to the performance of sequence models."
— FAST paper, echoing the wisdom of the NLP tokenization literature