Physical Intelligence, 2025

RL Token: Bootstrapping
Online RL with VLAs

VLAs get you 90% of the way. The last 10% — sub-millimeter precision, speed beyond human teleoperation — requires practice. The RL Token lets a frozen VLA delegate precision to a tiny actor-critic that learns in minutes of real-world practice.

Prerequisites: VLA basics + Actor-critic RL
9
Chapters
3+
Simulations

Chapter 0: The Last Millimeter

VLAs are impressive generalists. They can pick up objects, open drawers, fold towels. But ask them to insert a USB charger into a port — a task requiring sub-millimeter precision — and they struggle. The motions are slow, tentative, full of retries. They work, sometimes, but they're not reliable.

The problem isn't intelligence — the VLA understands the task perfectly. The problem is precision at the critical phase. When you're inserting a screw, the first 95% of the motion (reaching, aligning) is easy. The last 5% (threading the screw into the hole) requires millimeter-perfect control that behavioral cloning from demonstrations can't consistently achieve.

You could fine-tune the entire VLA with RL, but that's expensive — billions of parameters to update, risk of catastrophic forgetting, and it takes days. What if you could keep the VLA frozen and just add a tiny, fast-learning module for precision?

The practical constraint: Real-world RL has a tight budget. Every episode takes real time. Hardware wears out. You might have a few hours — not days — to improve a task. Whatever method you use must be sample-efficient enough to show gains in minutes, not months.
Why do VLAs struggle with sub-millimeter precision tasks like screw insertion?

Chapter 1: The Key Idea

The RL Token (RLT) method creates a compact interface between the frozen VLA and a lightweight RL agent. The VLA processes the full observation (images, language, proprioception) and produces a compressed representation — the RL token. A small actor-critic network takes this token and outputs refined actions.

The analogy: Think of the VLA as a senior surgeon who understands the entire operation. The RL token is like the surgeon's verbal instructions to a junior resident who has fast, precise hands. The surgeon (VLA) provides broad understanding; the resident (actor-critic) provides fine motor precision. Together, they're better than either alone.

Two key design choices make this work:

  1. The VLA stays frozen. No risk of catastrophic forgetting. No expensive gradient computation through billions of parameters.
  2. The actor-critic is tiny (~1M parameters vs the VLA's billions). It can learn from a few hundred real-world episodes — minutes to hours of practice.
Why does RLT keep the VLA frozen instead of fine-tuning it with RL?

Chapter 2: The RL Token

The RL token is a compressed representation extracted from the VLA's internal activations. Specifically:

  1. Run the full observation through the frozen VLA
  2. An encoder takes the VLA's internal features and compresses them into a compact vector (the RL token)
  3. A decoder verifies the RL token retains useful information by reconstructing the VLA's action output

The encoder and decoder are trained via a simple autoencoding objective: compress the VLA's features, then reconstruct the VLA's actions. This ensures the RL token captures everything the VLA knows about the current situation — object positions, task progress, spatial relationships — in a compact form the actor-critic can use efficiently.

RL Token Architecture
Why not just use the VLA's raw features? The VLA's internal representation is thousands of dimensions. Training an actor-critic on such a large input would be slow and sample-inefficient. The RL token compresses this to a much smaller vector, preserving only the task-relevant information. This compression is what enables learning in minutes rather than hours.
What is the RL token?

Chapter 3: The Actor-Critic Head

On top of the RL token, RLT trains a small actor-critic:

A critical detail: the actor's output is anchored to the VLA's action via a regularization term. The actor learns to make small adjustments to the VLA's base action, not to generate actions from scratch. This means:

Anchoring is the secret sauce. Without it, the small actor would learn a completely new policy from scratch — slow, sample-inefficient, and losing the VLA's knowledge. With anchoring, it starts from the VLA's behavior and makes targeted improvements. This is why RLT can learn in minutes.
Why is the actor anchored to the VLA's action output?

Chapter 4: Division of Labor

RLT creates an elegant division of labor:

ComponentSizeRoleTrainable?
VLA (frozen)~3B paramsPerception, language understanding, coarse manipulationNo
Encoder~100K paramsCompress VLA features → RL tokenYes (pre-trained)
Actor~500K paramsRefine actions for precisionYes (online RL)
Critic~500K paramsEstimate value for RL updatesYes (online RL)

The frozen VLA handles the hard part — understanding what to do from images and language. The tiny actor-critic handles the precise part — exactly how to move at critical moments. Total trainable parameters during RL: ~1M. Compare this to fine-tuning the full 3B VLA.

What is the ratio of trainable RL parameters to frozen VLA parameters in RLT?

Chapter 5: Training

RLT training has two phases:

Phase 1: RL token extraction (offline, ~1 hour)

Train the encoder-decoder on existing demonstrations. Run each demo through the frozen VLA, extract internal features, train the encoder to compress them and the decoder to reconstruct the VLA's actions. After this phase, you have a working RL token that captures the VLA's knowledge.

Phase 2: Online RL (on-robot, minutes to hours)

Deploy the system on the real robot. The VLA processes observations, the encoder produces RL tokens, and the actor generates refined actions. After each episode, update the actor and critic using a sample-efficient RL algorithm (RLPD — RL with prior data).

The reward is sparse and simple: +1 if the task succeeded, 0 otherwise. No reward engineering needed.

Sample efficiency matters: On the screw insertion task, RLT shows significant improvement after just ~30 minutes of real-world practice. On charger insertion, meaningful gains appear in under an hour. This is practical enough for real deployment scenarios where you fine-tune for a specific setup.
How long does online RL fine-tuning with RLT typically take to show significant improvement?

Chapter 6: Results

RLT is evaluated on four tasks that all require sub-millimeter precision:

TaskChallengeSpeedupSuccess Improvement
Screw installationThread screw into hole3x faster20% → 65%
Zip tie fasteningThread through tight slot2x fasterSignificant
Charger insertionAlign and insert USB-C1.5x fasterSignificant
Ethernet insertionClick RJ45 into port2x fasterSignificant
Critical Phase Speedup
The gains are concentrated where they matter: RLT doesn't change the reaching or alignment phases much — the VLA already handles those well. The biggest improvements are in the critical insertion/threading phases where precision determines success or failure.
Where does RLT produce the largest improvements?

Chapter 7: Beyond Human Speed

The most striking result: on some of the most dexterous phases, RLT-trained policies surpass expert human teleoperation speed while maintaining reliability.

This is remarkable because imitation learning can never exceed the demonstrator. But RL can. The robot discovers faster, more efficient motion strategies that the human teleoperator didn't use — precisely because it optimizes for speed while maintaining success, rather than copying a human's cautious approach.

Why RL surpasses humans here: Human teleoperators are cautious — they slow down at critical phases to avoid failure. The robot, through practice, discovers that certain faster motions are actually MORE reliable than slow, cautious ones (e.g., a quick, decisive insertion rather than a slow, shaky approach). RL finds these strategies because it optimizes for the outcome, not for mimicking human behavior.
How can an RL-trained robot surpass human teleoperation speed?

Chapter 8: Connections

RLT occupies a specific niche in the VLA improvement landscape:

MethodWhat It ImprovesSample Efficiency
RECAP (pi*0.6)Overall task success via offline RL + advantage conditioningHours to days
RLTPrecision at critical phases via online RL + lightweight actor-criticMinutes to hours
Full VLA fine-tuneEverything, but risks forgettingDays to weeks

RLT and RECAP are complementary: RECAP improves the VLA's overall behavior through offline RL on diverse experience. RLT adds surgical precision improvements through online RL on specific critical phases. A production system might use both.

Related lessons: pi-0pi*0.6Gleams: RL Algorithms