Physical Intelligence, 2025

RL Token: Bootstrapping
Online RL with VLAs

VLAs get you 90% of the way. The last 10% — sub-millimeter precision, speed beyond human teleoperation — requires practice. The RL Token lets a frozen VLA delegate precision to a tiny actor-critic that learns in minutes of real-world practice.

Prerequisites: VLA basics + Actor-critic RL

Chapters

Simulations

Chapter 0: The Last Millimeter

VLAs are impressive generalists. They can pick up objects, open drawers, fold towels. But ask them to insert a USB charger into a port — a task requiring sub-millimeter precision — and they struggle. The motions are slow, tentative, full of retries. They work, sometimes, but they're not reliable.

The problem isn't intelligence — the VLA understands the task perfectly. The problem is precision at the critical phase. When you're inserting a screw, the first 95% of the motion (reaching, aligning) is easy. The last 5% (threading the screw into the hole) requires millimeter-perfect control that behavioral cloning from demonstrations can't consistently achieve.

You could fine-tune the entire VLA with RL, but that's expensive — billions of parameters to update, risk of catastrophic forgetting, and it takes days. What if you could keep the VLA frozen and just add a tiny, fast-learning module for precision?

The practical constraint: Real-world RL has a tight budget. Every episode takes real time. Hardware wears out. You might have a few hours — not days — to improve a task. Whatever method you use must be sample-efficient enough to show gains in minutes, not months.

Why do VLAs struggle with sub-millimeter precision tasks like screw insertion?

Behavioral cloning from demonstrations can't reliably capture the precise, critical-phase motions — small errors at these stages compound into failure VLAs can't process high-resolution images The robot hardware isn't precise enough

Chapter 1: The Key Idea

The RL Token (RLT) method creates a compact interface between the frozen VLA and a lightweight RL agent. The VLA processes the full observation (images, language, proprioception) and produces a compressed representation — the RL token. A small actor-critic network takes this token and outputs refined actions.

The analogy: Think of the VLA as a senior surgeon who understands the entire operation. The RL token is like the surgeon's verbal instructions to a junior resident who has fast, precise hands. The surgeon (VLA) provides broad understanding; the resident (actor-critic) provides fine motor precision. Together, they're better than either alone.

Two key design choices make this work:

The VLA stays frozen. No risk of catastrophic forgetting. No expensive gradient computation through billions of parameters.
The actor-critic is tiny (~1M parameters vs the VLA's billions). It can learn from a few hundred real-world episodes — minutes to hours of practice.

Why does RLT keep the VLA frozen instead of fine-tuning it with RL?

Freezing preserves the VLA's generalization, avoids catastrophic forgetting, and eliminates expensive gradient computation through billions of parameters The VLA is already perfect RL can't be applied to transformers

Chapter 2: The RL Token

The RL token is a compressed representation extracted from the VLA's internal activations. Specifically:

Run the full observation through the frozen VLA
An encoder takes the VLA's internal features and compresses them into a compact vector (the RL token)
A decoder verifies the RL token retains useful information by reconstructing the VLA's action output

The encoder and decoder are trained via a simple autoencoding objective: compress the VLA's features, then reconstruct the VLA's actions. This ensures the RL token captures everything the VLA knows about the current situation — object positions, task progress, spatial relationships — in a compact form the actor-critic can use efficiently.

RL Token Architecture

Why not just use the VLA's raw features? The VLA's internal representation is thousands of dimensions. Training an actor-critic on such a large input would be slow and sample-inefficient. The RL token compresses this to a much smaller vector, preserving only the task-relevant information. This compression is what enables learning in minutes rather than hours.

What is the RL token?

A compact representation extracted from the VLA's internal features via a trained encoder, preserving task-relevant knowledge in a form efficient for online RL A special vocabulary token added to the VLA A reward signal

Chapter 3: The Actor-Critic Head

On top of the RL token, RLT trains a small actor-critic:

Actor: takes the RL token → outputs a refined action (adjusting the VLA's suggested action)
Critic: takes the RL token + action → estimates the expected reward (Q-value)

A critical detail: the actor's output is anchored to the VLA's action via a regularization term. The actor learns to make small adjustments to the VLA's base action, not to generate actions from scratch. This means:

Early in training, the policy behaves almost identically to the VLA (safe baseline)
As training progresses, the actor learns to deviate where it matters — at critical precision points
The VLA's generalization is preserved for non-critical phases

Anchoring is the secret sauce. Without it, the small actor would learn a completely new policy from scratch — slow, sample-inefficient, and losing the VLA's knowledge. With anchoring, it starts from the VLA's behavior and makes targeted improvements. This is why RLT can learn in minutes.

Why is the actor anchored to the VLA's action output?

So it starts from the VLA's already-good behavior and makes targeted refinements, rather than learning from scratch — preserving the VLA's knowledge and enabling learning in minutes To make the actor output match the VLA exactly To reduce the actor's parameter count

Chapter 4: Division of Labor

RLT creates an elegant division of labor:

Component	Size	Role	Trainable?
VLA (frozen)	~3B params	Perception, language understanding, coarse manipulation	No
Encoder	~100K params	Compress VLA features → RL token	Yes (pre-trained)
Actor	~500K params	Refine actions for precision	Yes (online RL)
Critic	~500K params	Estimate value for RL updates	Yes (online RL)

The frozen VLA handles the hard part — understanding what to do from images and language. The tiny actor-critic handles the precise part — exactly how to move at critical moments. Total trainable parameters during RL: ~1M. Compare this to fine-tuning the full 3B VLA.

What is the ratio of trainable RL parameters to frozen VLA parameters in RLT?

~1M trainable (actor-critic) vs ~3B frozen (VLA) — about 0.03% of the total model is updated during RL About 50/50 Everything is trainable

Chapter 5: Training

RLT training has two phases:

Phase 1: RL token extraction (offline, ~1 hour)

Train the encoder-decoder on existing demonstrations. Run each demo through the frozen VLA, extract internal features, train the encoder to compress them and the decoder to reconstruct the VLA's actions. After this phase, you have a working RL token that captures the VLA's knowledge.

Phase 2: Online RL (on-robot, minutes to hours)

Deploy the system on the real robot. The VLA processes observations, the encoder produces RL tokens, and the actor generates refined actions. After each episode, update the actor and critic using a sample-efficient RL algorithm (RLPD — RL with prior data).

The reward is sparse and simple: +1 if the task succeeded, 0 otherwise. No reward engineering needed.

Sample efficiency matters: On the screw insertion task, RLT shows significant improvement after just ~30 minutes of real-world practice. On charger insertion, meaningful gains appear in under an hour. This is practical enough for real deployment scenarios where you fine-tune for a specific setup.

How long does online RL fine-tuning with RLT typically take to show significant improvement?

Minutes to a few hours of real-world robot practice — practical for real deployment Days of training Weeks of simulation

Chapter 6: Results

RLT is evaluated on four tasks that all require sub-millimeter precision:

Task	Challenge	Speedup	Success Improvement
Screw installation	Thread screw into hole	3x faster	20% → 65%
Zip tie fastening	Thread through tight slot	2x faster	Significant
Charger insertion	Align and insert USB-C	1.5x faster	Significant
Ethernet insertion	Click RJ45 into port	2x faster	Significant

Critical Phase Speedup

The gains are concentrated where they matter: RLT doesn't change the reaching or alignment phases much — the VLA already handles those well. The biggest improvements are in the critical insertion/threading phases where precision determines success or failure.

Where does RLT produce the largest improvements?

In the critical precision phases (insertion, threading) — the parts where the VLA struggles most and small errors cause failure In the reaching phase Uniformly across all phases

Chapter 7: Beyond Human Speed

The most striking result: on some of the most dexterous phases, RLT-trained policies surpass expert human teleoperation speed while maintaining reliability.

This is remarkable because imitation learning can never exceed the demonstrator. But RL can. The robot discovers faster, more efficient motion strategies that the human teleoperator didn't use — precisely because it optimizes for speed while maintaining success, rather than copying a human's cautious approach.

Why RL surpasses humans here: Human teleoperators are cautious — they slow down at critical phases to avoid failure. The robot, through practice, discovers that certain faster motions are actually MORE reliable than slow, cautious ones (e.g., a quick, decisive insertion rather than a slow, shaky approach). RL finds these strategies because it optimizes for the outcome, not for mimicking human behavior.

How can an RL-trained robot surpass human teleoperation speed?

RL optimizes for task success, not human mimicry — it discovers faster strategies (like decisive insertions) that humans' cautious teleoperation wouldn't attempt The robot has faster motors The robot uses a bigger model

Chapter 8: Connections

RLT occupies a specific niche in the VLA improvement landscape:

Method	What It Improves	Sample Efficiency
RECAP (pi*0.6)	Overall task success via offline RL + advantage conditioning	Hours to days
RLT	Precision at critical phases via online RL + lightweight actor-critic	Minutes to hours
Full VLA fine-tune	Everything, but risks forgetting	Days to weeks

RLT and RECAP are complementary: RECAP improves the VLA's overall behavior through offline RL on diverse experience. RLT adds surgical precision improvements through online RL on specific critical phases. A production system might use both.

Related lessons: pi-0 • pi*0.6 • Gleams: RL Algorithms

RL Token: BootstrappingOnline RL with VLAs