Physical Intelligence + UC Berkeley, 2025

FAST: Efficient Action
Tokenization

Why do VLAs fail at dexterous tasks? Because naive action tokenization destroys high-frequency information. FAST fixes this with compression-based tokenization using the discrete cosine transform — producing 5-10x fewer tokens and enabling 5x faster training.

Prerequisites: VLA basics + Tokenization concepts

Chapters

Simulations

Chapter 0: The Problem

Autoregressive VLAs like RT-2 and OpenVLA work by predicting robot actions as discrete tokens — just like language models predict word tokens. But how do you convert a continuous action (like "move joint 3 by 0.0342 radians") into a discrete token?

The standard approach is naive binning: divide each action dimension into 256 equally-spaced bins. Joint angle 0.0342 maps to bin 137. The model predicts "137" as a token. Simple.

This works fine for low-frequency control (5-10 Hz, like RT-2's pick-and-place). But it completely breaks down for high-frequency dexterous tasks (50 Hz, like folding laundry). The paper demonstrates this with a striking experiment: as control frequency increases, naive-tokenized VLAs stop learning entirely — they just copy the first action token over and over.

The failure is fundamental, not a bug: At high frequencies, consecutive actions are nearly identical (the robot barely moves in 20ms). So consecutive tokens are nearly identical too. The autoregressive model learns that "the next token is probably the same as the last one" — and gets stuck in this trivial solution. The marginal information per token approaches zero.

High-Frequency Tokenization Collapse

Drag the frequency slider. At low Hz, naive binning works. At high Hz, consecutive tokens become identical and the model collapses to copying.

Control freq10 Hz

Why does naive per-timestep binning fail at high control frequencies?

Consecutive tokens become nearly identical — marginal information per token approaches zero, causing the model to just copy the previous token The vocabulary becomes too large The bins are too wide

Chapter 1: Why Binning Fails

To understand the failure deeply, consider the autoregressive training objective. The model is trained to predict the next token T_i given all previous tokens T_1:i-1. The learning signal is proportional to the marginal information content of T_i given T_1:i-1.

With naive per-timestep binning at high frequencies:

Action at time t: a_t = [0.5000, 0.3000, ...]
Action at time t+1: a_t+1 = [0.5002, 0.3001, ...]
Both bin to the same token: [128, 77, ...]

The model sees the same token repeated. Predicting "same as before" achieves near-zero loss. The model has no incentive to learn the actual underlying motion — the tokenization has destroyed the signal.

The analogy from language: Imagine tokenizing English text one character at a time, but the text is "aaaaaaaaabbbbbbbbb." The model learns "predict the same letter" and stops. This is exactly what happens with high-frequency action data — the tokenization creates long runs of identical tokens.

The paper proves this with a controlled experiment: train the same model on the same data at different sampling rates (25 Hz to 800 Hz). At low rates, it works. At high rates, it collapses to copying. The data itself hasn't changed — only the tokenization has.

In the paper's controlled experiment, what changed as sampling rate increased from 25 Hz to 800 Hz?

The data became more complex The model capacity was reduced Nothing about the data changed — only the sampling rate, proving the failure is purely a tokenization problem

Chapter 2: DCT Compression

The solution comes from signal processing. The Discrete Cosine Transform (DCT) decomposes a signal into a sum of cosine waves at different frequencies. Low-frequency components capture the overall shape; high-frequency components capture sharp details.

Robot actions are smooth signals — the robot's joints don't teleport. This means most of the information is in the low-frequency DCT coefficients. The high-frequency coefficients are near zero and can be dropped.

This is the exact same principle behind JPEG image compression: pixels vary smoothly, so most DCT coefficients are near zero, so images compress well. Robot actions compress even better because they're 1D time series (smoother than 2D images).

DCT: From Time Domain to Frequency Domain

A smooth action trajectory has most energy in low frequencies. Drag the cutoff to see how many coefficients you actually need.

Keep top N coeffs5

Compression ratio: A 50-timestep action chunk across 7 dimensions = 350 values with naive tokenization. After DCT compression, the same information is captured in ~30-50 non-zero coefficients. That's 7-10x compression — and 7-10x fewer tokens for the autoregressive model to predict.

Why does DCT work well for compressing robot actions?

Robot actions are smooth signals — most energy is in low-frequency DCT coefficients, so high-frequency coefficients can be dropped with minimal information loss DCT is faster to compute than FFT DCT produces integer outputs

Chapter 3: The FAST Algorithm

FAST (Frequency-space Action Sequence Tokenization) turns DCT compression into a tokenization pipeline:

Step 1: Normalize

Scale each action dimension to [-1, 1] using 1st/99th percentile (robust to outliers)

↓

Step 2: DCT

Apply discrete cosine transform to each action dimension separately

↓

Step 3: Scale & Round

Multiply by a scaling factor, round to integers. Most coefficients become 0 (sparse!)

↓

Step 4: Flatten & BPE

Serialize the sparse matrix, apply byte-pair encoding to merge common patterns into tokens

↓

Result

~30-70 discrete tokens (vs 350+ with naive binning) with higher information content each

The key insight in Step 3: after DCT and scaling, the coefficient matrix is sparse — most entries are zero. The non-zero entries capture all the important information. Step 4 uses BPE (the same technique used in language model tokenizers like tiktoken) to further compress common patterns in the coefficient sequence.

The scaling factor is the only hyperparameter. It controls the tradeoff between lossiness and compression. Larger scaling = more non-zero coefficients = less lossy but more tokens. Smaller scaling = more zeros = more lossy but fewer tokens. The paper finds the sweet spot where reconstruction quality is excellent with 5-10x compression.

What makes the DCT coefficient matrix sparse (mostly zeros) after the scale-and-round step?

Smooth action signals have most energy in low-frequency coefficients — high-frequency coefficients are near zero and round to exactly zero The scaling factor forces all coefficients to zero BPE removes the coefficients

Chapter 4: FAST+ Universal Tokenizer

FAST with manually-chosen parameters works well, but the authors go further: they train a universal tokenizer called FAST+ on 1 million real robot action trajectories spanning diverse robot types, action spaces, and control frequencies.

FAST+ learns the optimal BPE vocabulary across all these different robots. Once trained, it works as a black-box tokenizer — feed it any robot's action sequence, and it produces efficient tokens without any per-robot tuning.

What FAST+ covers

Dimension	Coverage
Robot types	Single-arm, dual-arm, mobile manipulators, humanoids
Action spaces	Joint positions, joint velocities, Cartesian deltas, gripper commands
Control frequencies	5 Hz to 200 Hz
Training data	1M real trajectories from diverse sources

Like tiktoken for robots: Just as tiktoken is a universal text tokenizer that works for any language or domain, FAST+ is a universal action tokenizer that works for any robot or task. You don't need to design a new tokenizer for each robot — FAST+ handles it.

What is FAST+ and why is it useful?

A universal action tokenizer trained on 1M trajectories that works as a black-box for any robot without per-robot tuning A faster version of the FAST algorithm A new robot architecture

Chapter 5: Training Speedup

FAST doesn't just enable new tasks — it makes training dramatically faster. The key: fewer tokens per action chunk means fewer autoregressive prediction steps per training example.

With naive binning on a 50-step action chunk across 7 dimensions: 350 tokens per example. With FAST: ~50-70 tokens. That's 5-7x fewer tokens. Since autoregressive training cost scales linearly with sequence length, FAST training is up to 5x faster.

But the speedup isn't just from fewer tokens. Each FAST token carries more information than a naive bin token. A single FAST token might encode the low-frequency shape of an entire action dimension, while a naive token encodes just one value at one timestep. The autoregressive objective becomes more meaningful — each token prediction contributes real information.

The double win: FAST produces (1) fewer tokens AND (2) more informative tokens. Fewer tokens = faster training. More informative tokens = better learning signal. This is why FAST doesn't just match naive tokenization quality while being faster — it actually achieves BETTER quality on high-frequency tasks where naive tokenization fails entirely.

Why does FAST enable up to 5x faster VLA training?

Fewer tokens per action chunk (50-70 vs 350) means fewer autoregressive steps, and each token carries more information so the learning signal is stronger FAST uses a smaller model FAST uses less data

Chapter 6: Results

The paper demonstrates FAST on several challenging benchmarks:

pi-0-FAST matches pi-0 (diffusion)

When FAST tokenization is integrated with the pi-0 VLA architecture, the resulting pi-0-FAST model matches the performance of the original pi-0 (which uses flow matching / diffusion) across dexterous manipulation tasks — while training 5x faster. This is remarkable: FAST makes autoregressive VLAs competitive with diffusion VLAs for the first time on high-frequency tasks.

DROID dataset — first successful training

The DROID dataset is a large-scale "in-the-wild" manipulation dataset. Previous autoregressive VLAs (including OpenVLA) couldn't successfully train on DROID due to its high-frequency control (15 Hz with action chunks). FAST enables the first successful language-conditioned generalist policy trained on DROID that generalizes zero-shot to unseen environments.

FAST vs Naive Binning: Training Efficiency

First of its kind: The FAST-trained policy on DROID is the first language-conditioned generalist manipulation policy that can be evaluated zero-shot in completely unseen environments by simply prompting it in natural language.

What did FAST enable for the first time on the DROID dataset?

The first successful training of a language-conditioned generalist policy that generalizes zero-shot to unseen environments Faster data collection Better cameras

Chapter 7: Integration with pi-0

FAST plays a critical role in the pi-0 model family's two-stage training recipe:

Pre-training

Use FAST tokens (discrete, autoregressive) on ALL data sources — robot data, web data, language, subtask prediction

↓

Post-training

Switch to flow matching (continuous) for precise, high-frequency control on task-specific data

FAST enables the pre-training stage to work. Without it, the autoregressive pre-training would fail on high-frequency data. With it, the model can learn from ALL data sources using standard next-token prediction — the same objective that works for language models.

This is the secret ingredient behind pi-0.5's success: FAST tokenization during pre-training lets the model absorb knowledge from diverse data sources, and flow matching during post-training gives it the precision for dexterous tasks.

FAST is the bridge: It connects the world of autoregressive language models (which need discrete tokens) with the world of robot control (which needs continuous precision). Pre-train with FAST tokens for breadth, post-train with flow matching for depth.

Why is FAST used in pre-training but flow matching in post-training?

FAST's discrete tokens enable unified autoregressive training on ALL data types (robot + web + language); flow matching gives continuous precision for the final task FAST is faster to compute Flow matching doesn't work during pre-training

Chapter 8: Connections

FAST sits at the intersection of three big ideas:

Domain	Parallel
Language	BPE (byte-pair encoding) compresses text into meaningful subword tokens. FAST compresses actions into meaningful frequency tokens.
Images	JPEG uses DCT to compress images by dropping high-frequency details. FAST does the same for robot actions.
Audio	MP3 uses frequency-domain compression. Robot actions at 50 Hz are essentially a low-bandwidth audio signal.

The paper makes a broader point: tokenization matters as much for robots as it does for language. BPE was a turning point for language models. FAST may be the same for robot foundation models.

Related lessons: pi-0 • pi-0.5 • Gleams: Transformer • Gleams: VLA

"A good choice of tokenization can be critical to the performance of sequence models."

— FAST paper, echoing the wisdom of the NLP tokenization literature

FAST: Efficient ActionTokenization