Physical Intelligence, 2025

Knowledge Insulating
VLAs

Train fast, run fast, generalize better. The secret: stop gradients from the action expert corrupting the VLM's web-scale knowledge. Two losses, one model, zero interference.

Prerequisites: pi-0 architecture + Flow matching
9
Chapters
3+
Simulations

Chapter 0: The Problem

When you fine-tune a VLM (like PaLiGemma) for robot control, you add new components — an action expert that outputs continuous actions via flow matching. These new components have randomly initialized weights. During training, gradients from these random weights flow backward through the VLM backbone.

This is catastrophic. The VLM spent billions of compute-hours learning to understand images and language from the web. Now, random gradients from an uninitialized action head are overwriting that knowledge. The VLM backbone "forgets" what a kitchen looks like, what "pick up the plate" means, how spatial relationships work.

The result: the VLA trains slower (fighting to relearn what the VLM already knew), runs slower (autoregressive action decoding is inherently slow), and generalizes worse (the corrupted VLM features don't transfer to new scenes).

The paradox: You add an action expert to make the model BETTER at robot control, but the process of training that expert makes the model WORSE at understanding the world. The cure is worse than the disease — unless you insulate the backbone from the action expert's gradients.
What happens when gradients from a randomly-initialized action expert flow back into a pre-trained VLM?

Chapter 1: The Key Insight

Stop the gradients. Don't let the action expert's flow matching loss propagate back into the VLM backbone. Instead, give the backbone its OWN training signal: a standard next-token prediction loss on discretized actions (just like training a language model).

This creates two parallel training objectives in one model:

VLM Backbone
Trained with next-token prediction on discrete action tokens + VLM data. Gradients from action expert BLOCKED.
↓ features flow down (no gradients flow back up)
Action Expert
Trained with flow matching on continuous actions. Uses backbone features as input, but doesn't corrupt them.

The backbone learns good representations for robot control via discrete token prediction (which is a natural fit for transformer training). The action expert learns precise continuous control via flow matching (which gives smooth, high-frequency actions). Neither interferes with the other.

Why this works: Discrete token prediction is the SAME objective the VLM was pre-trained with (next-token prediction on text). So the backbone keeps learning in its native mode — it never sees the alien gradients from flow matching. The action expert gets the backbone's rich features as input but can't damage them.
What trains the VLM backbone in the knowledge-insulated architecture?

Chapter 2: Gradient Flow

In a standard VLA (like the original pi-0), gradients flow freely between all components. The flow matching loss on the action expert sends gradients backward through the backbone — and these gradients are initially random and noisy because the action expert starts from random weights.

In knowledge-insulated training, a stop-gradient operation blocks this backward flow:

Gradient Flow: Standard vs Insulated

The stop-gradient is trivially easy to implement — one line of code in PyTorch (features.detach()). But its effect is profound: it completely prevents the action expert from corrupting the backbone's representations.

How is knowledge insulation implemented in practice?

Chapter 3: Knowledge Insulation

The name "knowledge insulation" comes from electrical insulation — preventing unwanted current flow. Here, we prevent unwanted gradient flow from damaging the VLM's pre-trained knowledge.

But insulation alone isn't enough. If you just block gradients and only train the backbone with discrete tokens, you lose the ability to output continuous actions. The trick is the dual architecture:

ComponentTraining SignalPurpose
VLM BackboneNext-token prediction (discrete)Learn representations, understand scenes, follow instructions
Action ExpertFlow matching (continuous)Output precise, smooth motor commands at high frequency

At inference time, you only need the action expert — it generates continuous actions directly, without autoregressive token decoding. This means inference is fast (no sequential token generation) and precise (continuous, not discretized).

The triple win: (1) Train FAST because the backbone uses next-token prediction, which is stable and efficient. (2) Run FAST because inference uses only the smaller action expert. (3) Generalize BETTER because the backbone's VLM knowledge is preserved, not corrupted.
At inference time, which component generates the actual robot actions?

Chapter 4: Dual Losses

The total training loss has two components that train different parts of the model:

Ltotal = LNTP(backbone) + LFM(action expert)

LNTP (Next Token Prediction): standard cross-entropy loss on discrete action tokens and VLM co-training data. This trains the backbone to produce good features.

LFM (Flow Matching): the flow matching denoising loss on continuous actions. This trains the action expert. The backbone features it receives are .detach()ed — no gradients flow back.

A critical finding: having both losses is essential. Removing either one significantly degrades performance. The discrete loss gives the backbone a stable learning signal. The continuous loss gives the action expert precision. You need both.

VLM co-training bonus: Because the backbone uses standard next-token prediction, you can co-train it on general VLM data (image captioning, question answering) alongside robot data. This further preserves and even enhances the backbone's general understanding — the same trick that helps pi-0.5 generalize to new homes.
What happens if you remove the discrete token loss and only use flow matching?

Chapter 5: Training Speed

Knowledge insulation doesn't just improve quality — it dramatically accelerates training. The paper reports up to 5x faster convergence compared to standard VLA training.

Why? Two reasons:

  1. Stable backbone learning: The discrete token objective is clean and well-conditioned — no random gradients from flow matching disrupting the optimization landscape. The backbone converges smoothly.
  2. Better initial features for the action expert: Because the backbone learns good representations quickly (via its native NTP objective), the action expert receives useful features from the start. It doesn't waste compute trying to learn from corrupted features.
Why does knowledge insulation speed up training by up to 5x?

Chapter 6: Results

The paper evaluates on complex real-world tasks (mobile bimanual manipulation) and standard benchmarks (DROID, LIBERO):

Knowledge Insulation vs Standard Training

Key findings:

The ablation is decisive: When you remove knowledge insulation (let gradients flow freely), every metric gets worse — training speed, task success, and generalization. The insulation isn't optional; it's essential.
Which THREE properties does knowledge insulation improve simultaneously?

Chapter 7: VLM Co-training

A bonus benefit of knowledge insulation: because the backbone uses standard next-token prediction, you can seamlessly mix in general VLM training data — image captioning, visual question answering, interleaved image-text. This is impossible when the backbone is trained with flow matching gradients.

VLM co-training further strengthens the backbone's representations and prevents forgetting. The model stays sharp on visual understanding while learning robot control — the best of both worlds.

Why does knowledge insulation enable VLM co-training?

Chapter 8: Connections

Knowledge insulation is now a core component of the pi-0 model family's training recipe:

PaperWhat It Contributes
pi-0VLM backbone + flow matching action expert
FASTEfficient discrete action tokenization
KI-VLAStop-gradient insulation between backbone and action expert
pi-0.5All three combined → open-world generalization

The progression is clear: FAST enables efficient discrete tokens for the backbone. KI-VLA prevents the action expert from corrupting those representations. Together, they make the two-stage training recipe (discrete pre-training → flow matching post-training) work.

Related: pi-0FASTpi-0.5