Physical Intelligence, 2025

Knowledge Insulating
VLAs

Train fast, run fast, generalize better. The secret: stop gradients from the action expert corrupting the VLM's web-scale knowledge. Two losses, one model, zero interference.

Prerequisites: pi-0 architecture + Flow matching

Chapters

Simulations

Chapter 0: The Problem

When you fine-tune a VLM (like PaLiGemma) for robot control, you add new components — an action expert that outputs continuous actions via flow matching. These new components have randomly initialized weights. During training, gradients from these random weights flow backward through the VLM backbone.

This is catastrophic. The VLM spent billions of compute-hours learning to understand images and language from the web. Now, random gradients from an uninitialized action head are overwriting that knowledge. The VLM backbone "forgets" what a kitchen looks like, what "pick up the plate" means, how spatial relationships work.

The result: the VLA trains slower (fighting to relearn what the VLM already knew), runs slower (autoregressive action decoding is inherently slow), and generalizes worse (the corrupted VLM features don't transfer to new scenes).

The paradox: You add an action expert to make the model BETTER at robot control, but the process of training that expert makes the model WORSE at understanding the world. The cure is worse than the disease — unless you insulate the backbone from the action expert's gradients.

What happens when gradients from a randomly-initialized action expert flow back into a pre-trained VLM?

The random gradients corrupt the VLM's pre-trained representations — it "forgets" visual and language understanding, leading to worse generalization The model trains faster Nothing — the VLM is robust to random gradients

Chapter 1: The Key Insight

Stop the gradients. Don't let the action expert's flow matching loss propagate back into the VLM backbone. Instead, give the backbone its OWN training signal: a standard next-token prediction loss on discretized actions (just like training a language model).

This creates two parallel training objectives in one model:

VLM Backbone

Trained with next-token prediction on discrete action tokens + VLM data. Gradients from action expert BLOCKED.

↓ features flow down (no gradients flow back up)

Action Expert

Trained with flow matching on continuous actions. Uses backbone features as input, but doesn't corrupt them.

The backbone learns good representations for robot control via discrete token prediction (which is a natural fit for transformer training). The action expert learns precise continuous control via flow matching (which gives smooth, high-frequency actions). Neither interferes with the other.

Why this works: Discrete token prediction is the SAME objective the VLM was pre-trained with (next-token prediction on text). So the backbone keeps learning in its native mode — it never sees the alien gradients from flow matching. The action expert gets the backbone's rich features as input but can't damage them.

What trains the VLM backbone in the knowledge-insulated architecture?

Next-token prediction on discretized actions + VLM data — the same objective it was pre-trained with, keeping it in its native learning mode Flow matching loss from the action expert No training — the backbone is completely frozen

Chapter 2: Gradient Flow

In a standard VLA (like the original pi-0), gradients flow freely between all components. The flow matching loss on the action expert sends gradients backward through the backbone — and these gradients are initially random and noisy because the action expert starts from random weights.

In knowledge-insulated training, a stop-gradient operation blocks this backward flow:

Gradient Flow: Standard vs Insulated

The stop-gradient is trivially easy to implement — one line of code in PyTorch (features.detach()). But its effect is profound: it completely prevents the action expert from corrupting the backbone's representations.

How is knowledge insulation implemented in practice?

A stop-gradient operation (detach) prevents the action expert's flow matching loss from sending gradients back into the VLM backbone The backbone and action expert are trained on separate GPUs The backbone weights are saved and restored after each step

Chapter 3: Knowledge Insulation

The name "knowledge insulation" comes from electrical insulation — preventing unwanted current flow. Here, we prevent unwanted gradient flow from damaging the VLM's pre-trained knowledge.

But insulation alone isn't enough. If you just block gradients and only train the backbone with discrete tokens, you lose the ability to output continuous actions. The trick is the dual architecture:

Component	Training Signal	Purpose
VLM Backbone	Next-token prediction (discrete)	Learn representations, understand scenes, follow instructions
Action Expert	Flow matching (continuous)	Output precise, smooth motor commands at high frequency

At inference time, you only need the action expert — it generates continuous actions directly, without autoregressive token decoding. This means inference is fast (no sequential token generation) and precise (continuous, not discretized).

The triple win: (1) Train FAST because the backbone uses next-token prediction, which is stable and efficient. (2) Run FAST because inference uses only the smaller action expert. (3) Generalize BETTER because the backbone's VLM knowledge is preserved, not corrupted.

At inference time, which component generates the actual robot actions?

The VLM backbone via autoregressive token decoding Only the action expert via flow matching — fast and continuous, no sequential decoding needed Both components equally

Chapter 4: Dual Losses

The total training loss has two components that train different parts of the model:

L_total = L_NTP(backbone) + L_FM(action expert)

L_NTP (Next Token Prediction): standard cross-entropy loss on discrete action tokens and VLM co-training data. This trains the backbone to produce good features.

L_FM (Flow Matching): the flow matching denoising loss on continuous actions. This trains the action expert. The backbone features it receives are .detach()ed — no gradients flow back.

A critical finding: having both losses is essential. Removing either one significantly degrades performance. The discrete loss gives the backbone a stable learning signal. The continuous loss gives the action expert precision. You need both.

VLM co-training bonus: Because the backbone uses standard next-token prediction, you can co-train it on general VLM data (image captioning, question answering) alongside robot data. This further preserves and even enhances the backbone's general understanding — the same trick that helps pi-0.5 generalize to new homes.

What happens if you remove the discrete token loss and only use flow matching?

Performance degrades significantly — the backbone gets corrupted by random gradients from the uninitialized action expert, losing its pre-trained knowledge Nothing changes — flow matching is sufficient Training becomes faster

Chapter 5: Training Speed

Knowledge insulation doesn't just improve quality — it dramatically accelerates training. The paper reports up to 5x faster convergence compared to standard VLA training.

Why? Two reasons:

Stable backbone learning: The discrete token objective is clean and well-conditioned — no random gradients from flow matching disrupting the optimization landscape. The backbone converges smoothly.
Better initial features for the action expert: Because the backbone learns good representations quickly (via its native NTP objective), the action expert receives useful features from the start. It doesn't waste compute trying to learn from corrupted features.

Why does knowledge insulation speed up training by up to 5x?

The backbone converges faster with stable NTP gradients, and the action expert benefits from better features sooner — no time wasted fighting corrupted representations It uses fewer parameters It uses a faster optimizer

Chapter 6: Results

The paper evaluates on complex real-world tasks (mobile bimanual manipulation) and standard benchmarks (DROID, LIBERO):

Knowledge Insulation vs Standard Training

Key findings:

Train fast: Up to 5x faster convergence to the same performance level
Run fast: Inference uses only the action expert — no autoregressive decoding
Generalize better: Preserved VLM knowledge enables better transfer to new scenes, new objects, and new instructions

The ablation is decisive: When you remove knowledge insulation (let gradients flow freely), every metric gets worse — training speed, task success, and generalization. The insulation isn't optional; it's essential.

Which THREE properties does knowledge insulation improve simultaneously?

Training speed, inference speed, and generalization — the rare triple win Only training speed Only generalization

Chapter 7: VLM Co-training

A bonus benefit of knowledge insulation: because the backbone uses standard next-token prediction, you can seamlessly mix in general VLM training data — image captioning, visual question answering, interleaved image-text. This is impossible when the backbone is trained with flow matching gradients.

VLM co-training further strengthens the backbone's representations and prevents forgetting. The model stays sharp on visual understanding while learning robot control — the best of both worlds.

Why does knowledge insulation enable VLM co-training?

Because the backbone uses next-token prediction — the same objective as VLM training — so general VLM data can be mixed in naturally Because the action expert is removed during co-training VLM co-training is always possible regardless of architecture

Chapter 8: Connections

Knowledge insulation is now a core component of the pi-0 model family's training recipe:

Paper	What It Contributes
pi-0	VLM backbone + flow matching action expert
FAST	Efficient discrete action tokenization
KI-VLA	Stop-gradient insulation between backbone and action expert
pi-0.5	All three combined → open-world generalization

The progression is clear: FAST enables efficient discrete tokens for the backbone. KI-VLA prevents the action expert from corrupting those representations. Together, they make the two-stage training recipe (discrete pre-training → flow matching post-training) work.

Related: pi-0 • FAST • pi-0.5

Knowledge InsulatingVLAs