VLAs get you 90% of the way. The last 10% — sub-millimeter precision, speed beyond human teleoperation — requires practice. The RL Token lets a frozen VLA delegate precision to a tiny actor-critic that learns in minutes of real-world practice.
VLAs are impressive generalists. They can pick up objects, open drawers, fold towels. But ask them to insert a USB charger into a port — a task requiring sub-millimeter precision — and they struggle. The motions are slow, tentative, full of retries. They work, sometimes, but they're not reliable.
The problem isn't intelligence — the VLA understands the task perfectly. The problem is precision at the critical phase. When you're inserting a screw, the first 95% of the motion (reaching, aligning) is easy. The last 5% (threading the screw into the hole) requires millimeter-perfect control that behavioral cloning from demonstrations can't consistently achieve.
You could fine-tune the entire VLA with RL, but that's expensive — billions of parameters to update, risk of catastrophic forgetting, and it takes days. What if you could keep the VLA frozen and just add a tiny, fast-learning module for precision?
The RL Token (RLT) method creates a compact interface between the frozen VLA and a lightweight RL agent. The VLA processes the full observation (images, language, proprioception) and produces a compressed representation — the RL token. A small actor-critic network takes this token and outputs refined actions.
Two key design choices make this work:
The RL token is a compressed representation extracted from the VLA's internal activations. Specifically:
The encoder and decoder are trained via a simple autoencoding objective: compress the VLA's features, then reconstruct the VLA's actions. This ensures the RL token captures everything the VLA knows about the current situation — object positions, task progress, spatial relationships — in a compact form the actor-critic can use efficiently.
On top of the RL token, RLT trains a small actor-critic:
A critical detail: the actor's output is anchored to the VLA's action via a regularization term. The actor learns to make small adjustments to the VLA's base action, not to generate actions from scratch. This means:
RLT creates an elegant division of labor:
| Component | Size | Role | Trainable? |
|---|---|---|---|
| VLA (frozen) | ~3B params | Perception, language understanding, coarse manipulation | No |
| Encoder | ~100K params | Compress VLA features → RL token | Yes (pre-trained) |
| Actor | ~500K params | Refine actions for precision | Yes (online RL) |
| Critic | ~500K params | Estimate value for RL updates | Yes (online RL) |
The frozen VLA handles the hard part — understanding what to do from images and language. The tiny actor-critic handles the precise part — exactly how to move at critical moments. Total trainable parameters during RL: ~1M. Compare this to fine-tuning the full 3B VLA.
RLT training has two phases:
Train the encoder-decoder on existing demonstrations. Run each demo through the frozen VLA, extract internal features, train the encoder to compress them and the decoder to reconstruct the VLA's actions. After this phase, you have a working RL token that captures the VLA's knowledge.
Deploy the system on the real robot. The VLA processes observations, the encoder produces RL tokens, and the actor generates refined actions. After each episode, update the actor and critic using a sample-efficient RL algorithm (RLPD — RL with prior data).
The reward is sparse and simple: +1 if the task succeeded, 0 otherwise. No reward engineering needed.
RLT is evaluated on four tasks that all require sub-millimeter precision:
| Task | Challenge | Speedup | Success Improvement |
|---|---|---|---|
| Screw installation | Thread screw into hole | 3x faster | 20% → 65% |
| Zip tie fastening | Thread through tight slot | 2x faster | Significant |
| Charger insertion | Align and insert USB-C | 1.5x faster | Significant |
| Ethernet insertion | Click RJ45 into port | 2x faster | Significant |
The most striking result: on some of the most dexterous phases, RLT-trained policies surpass expert human teleoperation speed while maintaining reliability.
This is remarkable because imitation learning can never exceed the demonstrator. But RL can. The robot discovers faster, more efficient motion strategies that the human teleoperator didn't use — precisely because it optimizes for speed while maintaining success, rather than copying a human's cautious approach.
RLT occupies a specific niche in the VLA improvement landscape:
| Method | What It Improves | Sample Efficiency |
|---|---|---|
| RECAP (pi*0.6) | Overall task success via offline RL + advantage conditioning | Hours to days |
| RLT | Precision at critical phases via online RL + lightweight actor-critic | Minutes to hours |
| Full VLA fine-tune | Everything, but risks forgetting | Days to weeks |
RLT and RECAP are complementary: RECAP improves the VLA's overall behavior through offline RL on diverse experience. RLT adds surgical precision improvements through online RL on specific critical phases. A production system might use both.