The robot foundation model that folds laundry, busses tables, and packs groceries — by combining a vision-language model with flow matching to generate smooth, continuous robot actions at 50 Hz.
Robot learning has a versatility problem. We can train a robot to pick up a specific cup in a specific lab. We can even train it on many cups in many labs. But ask it to fold a shirt — a task that requires understanding fabric dynamics, planning a sequence of folds, adapting to arbitrary initial configurations — and it falls apart.
The issue isn't intelligence. Vision-language models like GPT-4V can describe how to fold a shirt in perfect detail. The issue is that no amount of language understanding translates into the precise, high-frequency motor commands needed to actually manipulate fabric at 50 Hz.
Previous VLAs (like RT-2) solved this by discretizing actions into tokens — binning each action dimension into 256 values. This works for coarse pick-and-place, but discretization destroys the smoothness needed for dexterous manipulation. Try folding a shirt with only 256 possible positions per joint per timestep. You can't.
Industrial robots (in factories) run at >1000 Hz with sub-millimeter precision. They don't need VLMs because their environment is perfectly known — the same part arrives at the same position every time. The challenge VLAs solve is the opposite: handling unpredictable environments where the robot must understand what it sees to decide what to do. This requires a VLM. But the VLM is huge (billions of parameters), which means slower inference, which means lower control frequency.
pi-0 operates at the intersection: fast enough for smooth manipulation (50 Hz), smart enough to understand language and novel objects (VLM backbone). This dual requirement is what makes the engineering so constrained.
A robot arm with 7 joints needs a new position command for every joint, every 20 milliseconds. That's 7 floating-point numbers x 50 times per second = 350 continuous values per second. Each value needs sub-degree precision — the difference between a clean fold and a crumpled mess is often less than 2 degrees of wrist rotation.
With RT-2's 256-bin discretization, the minimum resolution per joint is approximately 360 degrees / 256 = 1.4 degrees per bin. Sounds fine in isolation. But errors compound across joints and timesteps. Over a 10-second fold sequence (500 timesteps x 7 joints), the accumulated quantization error makes smooth trajectories impossible.
Drag the resolution slider to see how discretization affects a smooth trajectory. At 256 bins it's coarse and jerky. Flow matching produces the smooth blue curve.
The damage isn't just "slightly less smooth." Discretization causes three specific failure modes in dexterous manipulation:
pi-0's insight is to treat the robot foundation model exactly like a language foundation model — but with a crucial architectural twist for actions.
In language, the recipe is: (1) pre-train a large model on diverse internet text, (2) post-train (fine-tune) on carefully curated data for the desired behavior. GPT-4 follows this recipe. So does Claude.
pi-0 follows the same recipe for robots: (1) start with a pre-trained VLM (PaLiGemma) that already understands images and text, (2) pre-train on diverse robot data from 7 robot types and 68 tasks, (3) post-train on high-quality data for specific dexterous tasks.
The twist: instead of predicting the next discrete token (like language models do), pi-0 uses flow matching to generate continuous action distributions. This gives it the precision to control robots at 50 Hz for tasks like folding laundry — something no discrete-token VLA can do.
Let's trace a single inference step through pi-0. The robot has two wrist cameras and one base camera. Here's exactly what happens:
Both diffusion (DDPM) and flow matching can generate continuous outputs. But there's a critical engineering difference: DDPM needs 50-1000 denoising steps because its paths are curved (the model must follow a complex SDE). Flow matching uses straight-line paths (optimal transport) from noise to data, requiring only 5-10 steps for comparable quality.
At 50 Hz control: if each denoising step takes ~2ms of GPU time, then 10 steps = 20ms per action chunk (fits in the 20ms budget). 100 steps = 200ms — already too slow for real-time. This is why pi-0 uses flow matching. It's not a minor preference; it's a hard engineering constraint.
pi-0 is built on PaLiGemma — a vision-language model from Google that combines a SigLIP image encoder with a Gemma language model. PaLiGemma was pre-trained on billions of image-text pairs from the web, so it already understands visual concepts, spatial relationships, and natural language.
Why not train from scratch? Because a VLM pre-trained on web data gives you an enormous head start. It already knows what a "cup" looks like, that "left of the plate" means a spatial relationship, and how to follow instructions like "pick up the red one." Without this foundation, a robot policy would need to learn all of this from relatively scarce robot demonstration data.
PaLiGemma uses late fusion: the image encoder processes images independently, then the resulting visual tokens are concatenated with text tokens and fed to the language model transformer. pi-0 extends this by adding a third modality — robot proprioception and actions — alongside vision and language.
There are two ways to combine vision and language in a multimodal model:
PaLiGemma uses late fusion because it allows the image encoder to be pre-trained independently (SigLIP on image-text contrastive learning) and the language model to be pre-trained independently (Gemma on text). You get the best of both specialists. The downside: the image encoder can't attend to language — it processes images context-free. For robot tasks, this is usually fine because the image content (what's on the table) doesn't depend on the language instruction (what to do with it).
For pi-0, late fusion has a bonus advantage: the SigLIP encoder can be kept frozen during post-training (no gradients needed), saving memory and preventing visual representation degradation. Only the Gemma backbone and action expert receive gradients during post-training.
Let's be precise about what flows through the network:
| Component | Architecture | Output Shape |
|---|---|---|
| SigLIP encoder | ViT-So400m/14, 400M params | 256 tokens x 1152 dims per image |
| Gemma backbone | 2B params, 18 layers, 2048 hidden | Contextualized embeddings |
| Action tokens | Learned embeddings | 50 tokens x 2048 dims (action chunk) |
| Action head MLP | 2-layer MLP per token | 7 or 14 dims (joint angles + gripper) |
Total model: ~3B parameters. The VLM backbone is ~2.4B, the action expert adds ~600M.
The camera produces 640x480 RGB images. Before reaching SigLIP, each image is:
Why 224x224? This is SigLIP's native resolution from pre-training. Higher resolution (448x448) would give 1024 patches per image — better for small objects but 4x the compute cost. The paper found 224x224 sufficient for manipulation-scale objects.
The VLM backbone provides:
What it does NOT provide (and must be learned during robot pre-training):
This is the core technical contribution. Instead of discretizing actions into bins (like RT-2's 256 tokens per dimension), pi-0 uses conditional flow matching to model the continuous distribution of actions.
Flow matching learns a velocity field that transports random noise into a desired distribution — in this case, the distribution of correct robot actions. During inference, you start with random noise and follow the velocity field to produce a clean action.
where xτ is the noisy action at flow time τ, ot is the observation (images + language + proprioception), and vθ is the learned velocity field.
You might wonder: why not just predict the action chunk directly via regression (MSE loss)? The answer is multi-modality. Consider "pick up the cup" — there are multiple valid grasps (top-down, side grasp, pinch grasp). A regression model would predict the mean of these grasps: a halfway-between grasp that doesn't work for any approach. Flow matching (like diffusion) can sample from a multi-modal distribution — on each inference call it might produce a top-down grasp OR a side grasp, both valid, but never the mean.
This matters less for simple tasks (there's usually one best approach to "move left"). But for dexterous manipulation with multiple valid strategies, it's the difference between a model that works and one that produces garbage. In the paper's ablation, replacing flow matching with direct regression drops success on multi-strategy tasks by ~25 pp while barely affecting single-strategy tasks.
During training, the process is simple. Let's walk through one training step with actual numbers:
At inference time, we start with pure noise x0 ~ N(0, I) of shape [50, 7] and take K = 10 Euler steps:
After 10 steps, x10 is a clean action chunk. Total time: ~2ms per step x 10 steps = 20ms. Well within the 20ms budget for 50 Hz control.
Let's trace one denoising step for a single joint (say, the shoulder pitch). We want the shoulder to move from its current position (0.5 rad) to a target (1.2 rad) over 50 timesteps:
The key insight: the velocity field doesn't predict the final trajectory directly. It predicts the direction to move in at each intermediate noise level. This is easier to learn (local corrections vs. global prediction) and produces better results with fewer steps.
Click Play to watch noise get transported to a clean action trajectory via the learned velocity field.
| Property | Flow Matching | DDPM (Diffusion) |
|---|---|---|
| Path shape | Straight lines (OT) | Curved (SDE) |
| Steps needed | 5-10 | 50-1000 |
| Inference latency | ~20ms | ~100-2000ms |
| Real-time capable? | Yes (50 Hz) | Marginal (5-10 Hz) |
| Training stability | Good (no noise schedule) | Sensitive to schedule |
Here's a subtle but important design choice. pi-0 doesn't just run everything through the same transformer weights. It uses two sets of weights — inspired by Mixture of Experts architectures.
The VLM backbone processes image and language tokens using the original PaLiGemma weights. The action expert is a separate set of transformer weights that processes proprioceptive state and action tokens.
Both experts share the same attention mechanism — action tokens attend to image and language tokens, and vice versa. But the feedforward layers (the "thinking" computation) use different weights for different token types.
The VLM backbone has been pre-trained on billions of image-text pairs. Its weights encode visual and linguistic knowledge. If you force action tokens through these same weights, you either degrade the VLM's knowledge (catastrophic forgetting) or constrain the action representation to fit a space designed for language.
The action expert lets the model develop action-specific representations without corrupting the VLM's pre-trained knowledge. It's like having a bilingual translator — one brain for understanding the scene (VLM), a separate specialist for generating motor commands (action expert).
The attention pattern is carefully designed:
This asymmetric mask is crucial. The VLM processes the scene exactly as it would without any action tokens — no interference. The action tokens then "read" the VLM's understanding and use it to generate coordinated motion across all 50 timesteps.
The choice of chunk size H=50 at 50 Hz = exactly 1 second of motion is deliberate:
In practice, pi-0 uses a sliding execution window: generate 50 actions, execute the first 25 (0.5 seconds), then re-plan with fresh camera images. This gives the smoothness of open-loop chunks with the responsiveness of closed-loop control.
Action chunking creates a tension between two control paradigms:
pi-0's sliding window is a hybrid: execute half the chunk open-loop (smooth), then re-plan with fresh observations (reactive). This works because most manipulation tasks are semi-static — objects don't move during a 0.5-second reach. For truly dynamic tasks (human handover, moving conveyor), the chunk size would need to be smaller.
The output of the action expert is a tensor of shape [50, D_action] where D_action depends on the robot. For a 7-DOF arm + gripper: D_action = 8. The values are delta joint angles (how much to move each joint from its current position), normalized to [-1, 1] and then scaled by robot-specific action limits. The gripper dimension is binary (open/close) but represented as a continuous value thresholded at 0.5.
pi-0 is pre-trained on data from 7 different robot configurations spanning 68 tasks. These include single-arm robots (UR5e, Franka), dual-arm systems, and mobile manipulators — each with different joint configurations, gripper types, and action spaces.
Additionally, pi-0 incorporates the entire Open X-Embodiment (OXE) dataset, which adds data from 22 more robot types. This gives the model exposure to an enormous variety of manipulation scenarios.
Different robots have different action dimensions (a 6-DOF arm vs a 7-DOF arm vs a mobile base + arm). pi-0 handles this by using robot-specific action tokenizers. The proprioceptive state and action dimensions are padded or projected to a common size, and the model learns which dimensions are relevant for which robot.
Concretely, the action space is standardized to a fixed maximum dimensionality (14 dimensions — enough for a bimanual system). Robots with fewer DOF simply zero-pad the unused dimensions. A robot identifier token is prepended to the language instruction so the model knows which dimensions are active.
| Robot Type | Config | Tasks | Hours |
|---|---|---|---|
| UR5e (single arm) | 6-DOF + gripper | Bussing, grocery, table setting | ~100 |
| Franka (single arm) | 7-DOF + gripper | Drawer packing, stacking | ~80 |
| ARX dual-arm | 2 x 6-DOF + 2 grippers | Laundry folding | ~60 |
| Mobile manipulator | Arm + 2D base | Mobile bussing, multi-room | ~50 |
| ALOHA bimanual | 2 x 6-DOF + 2 grippers | Cooking, cleaning | ~40 |
| Kuka arm | 7-DOF + gripper | Bin picking | ~30 |
| Sawyer | 7-DOF + gripper | Object placement | ~20 |
| + OXE dataset | 22 more types | Diverse manipulation | ~500 |
Total pre-training data: approximately 10,000+ hours of robot manipulation across all sources.
Let's trace how the same training batch handles two different robots:
The language prefix looks like: "[robot=ur5e_single] pick up the cup" vs "[robot=arx_bimanual] fold the towel". The model learns that for ur5e, only dimensions 1-7 are meaningful; for ARX, all 14 are active. Predicted values in padded dimensions are ignored.
This is elegant but has a limitation: the padded zeros are still processed by the action expert, wasting compute. Future work could use sparse attention to skip inactive dimensions.
Cross-embodiment training transfers:
What does NOT transfer and must be learned per-embodiment:
Like language models, the training recipe is arguably more important than the architecture. pi-0 uses a two-stage recipe that mirrors the pre-training/post-training split in LLMs:
Train on the full diverse data mixture: all 7 robot types, all 68 tasks, plus OXE data. Use both coarse task-level language labels ("fold the towel") and fine-grained segment annotations (~2-second snippets like "grasp the corner"). This stage runs for 700K gradient steps.
The goal: a base model with broad capabilities and generalization, but not necessarily expert at any one task.
Fine-tune on high-quality curated data for specific downstream tasks. For complex tasks like laundry folding, this uses larger carefully collected datasets. For simpler tasks, even small amounts of post-training data suffice.
The goal: specialize the base model into an expert at a specific dexterous task.
| Parameter | Pre-training | Post-training |
|---|---|---|
| Hardware | 64 TPU v5e pods | 16 TPU v5e pods |
| Batch size | 2048 | 256 |
| Learning rate | 1e-4 (cosine decay) | 1e-5 (constant) |
| Steps | 700K | 50K-100K per task |
| Duration | ~7 days | ~1 day per task |
| VLM backbone | Unfrozen (full fine-tune) | Frozen (only action expert trains) |
| Loss | Flow matching + language modeling | Flow matching only |
During pre-training, both the VLM and action expert are trained jointly — the VLM adapts to robot observations while the action expert learns to generate actions. During post-training, the VLM is frozen to prevent catastrophic forgetting, and only the action expert's weights are updated.
Pre-training uses a combined loss:
where Lflow is the flow matching loss (predict the velocity field) and Llanguage is the standard next-token prediction loss on the language tokens. The language loss keeps the VLM's language understanding sharp during robot fine-tuning. λ = 0.1 in practice — robot learning dominates but language isn't forgotten.
Without freezing during post-training, the model exhibits catastrophic forgetting within 10K steps: it becomes excellent at the fine-tuned task but loses the ability to generalize, follow novel instructions, or recover from errors. Freezing the VLM means post-training specializes the motor system without degrading the understanding system.
pi-0 uses delta joint angles as its primary action representation — each action is "move joint 1 by +0.02 radians, joint 2 by -0.01 radians, ..." This is a deliberate choice over end-effector (Cartesian) control:
The delta values are normalized to [-1, 1] where -1 and +1 correspond to the maximum per-step joint velocity (typically 0.1-0.5 radians per step, depending on the joint). The normalization is robot-specific and handled by the action tokenizer.
To understand pi-0's contribution, compare it to models available at the same time:
| Property | RT-2 (Google) | OpenVLA (Stanford) | Octo (Berkeley) | pi-0 |
|---|---|---|---|---|
| Params | 55B | 7B | 93M | 3B |
| Action type | Discrete (256 bins) | Discrete (256 bins) | Diffusion | Flow matching |
| Control freq | 3 Hz | 5 Hz | 10 Hz | 50 Hz |
| Dexterous tasks | No | No | Limited | Yes |
| Cross-embodiment | 1 robot | 1 robot | 9 robots | 7+ robots |
| Open source | No | Yes | Yes | No |
pi-0 is the smallest model that achieves 50 Hz control AND dexterous manipulation. RT-2 is too large for real-time inference. OpenVLA runs faster but can't do dexterity (discrete tokens). Octo is small and fast but limited in task complexity. pi-0 threads the needle: small enough for 50 Hz, expressive enough for dexterity.
This is where pi-0 shines — tasks that no previous VLA could handle. The paper demonstrates three showcase tasks that each require 10+ minutes of continuous manipulation:
The robot fetches laundry from a dryer, packs it into a hamper, brings the hamper to a folding table, then folds each article of clothing. This requires handling deformable objects (fabric) in arbitrary initial configurations — a fundamentally different challenge from rigid object manipulation. The robot must plan a sequence of folds based on the garment type (shirt vs pants vs towel) and adapt to how the fabric lands after each fold.
Concrete data flow during folding: The wrist cameras see the fabric at 224x224. The VLM identifies the garment type and current configuration ("shirt, spread flat, arms extended"). The action expert generates a 1-second chunk of 14 actions (bimanual: 2 arms x 7 joints) that initiates the first fold. After execution, fresh images trigger re-planning. A full fold sequence: ~20-40 action chunks = 20-40 seconds per garment.
The robot must clear a table, sorting items into the correct bins: dishes go in a bus bin, trash goes in a trash bin. This requires recognizing novel objects and deciding their category — is this a used napkin (trash) or a plate (dish)?
Why VLM pre-training is essential here: The model must classify objects it has never manipulated before (a particular brand of takeout container, a specific type of utensil). The VLM's web pre-training provides this classification ability — it has seen millions of images of plates, cups, and napkins. The action expert just needs to execute "pick up and place in bin A vs bin B."
The robot assembles a cardboard box from a flat template — folding flaps, tucking tabs, creating a 3D structure from a 2D object. This requires precise bimanual coordination and understanding of the geometric constraints of folding.
Less dramatic than folding but equally important: the robot packs groceries into bags. This tests a different skill profile — rapid object classification (is this fragile? heavy? cold?) combined with sequential bin-packing planning (heavy items on the bottom, eggs on top). The VLM's web pre-training is essential here: it knows that eggs are fragile and canned goods are heavy without being told, because this knowledge is embedded in the billions of image-text pairs it was trained on.
Grocery packing also reveals a subtlety about action representation: the robot must vary its grasp force based on object fragility (gentle for bread, firm for cans), but the action space doesn't include explicit force dimensions. Instead, the model learns to use slow, careful motions as a proxy for gentle handling — slower approach = softer contact. This is an emergent behavior, not an engineered feature.
Through deployment experience, pi-0 reveals clear degradation patterns:
| Condition | Effect | Severity |
|---|---|---|
| Novel object (similar shape) | Works — VLM generalizes | Minimal |
| Novel object (very different shape) | Grasps fail — action expert hasn't seen this geometry | Moderate |
| Changed lighting | Minor performance drop — SigLIP is robust to illumination | Low |
| Changed camera angle | Significant degradation — spatial mapping breaks | High |
| Ambiguous instruction | Model chooses one valid interpretation — usually reasonable | Low-moderate |
| Completely new environment layout | Fails — workspace mapping is environment-specific | Critical |
The last failure mode — new environments — is exactly what pi-0.5 was designed to solve.
During a laundry-folding episode (bimanual, ~2 minutes per garment), the inference pipeline runs continuously:
| Step | Time | Frequency |
|---|---|---|
| 2 wrist cameras capture | 5ms | Every 20ms (50 Hz) |
| SigLIP: 2 images → 512 tokens | 6ms | Every 500ms (2 Hz re-plan) |
| Gemma forward pass | 8ms | Every 500ms |
| Flow matching: 10 steps | 20ms | Every 500ms |
| Execute 25 bimanual actions | 500ms | At 50 Hz |
| Total per re-plan cycle | ~540ms |
The robot executes the first 25 actions open-loop while the next chunk is being computed. No idle time. For a 2-minute folding sequence, this means ~240 re-plan cycles and ~6000 individual motor commands.
pi-0 is evaluated in three settings: out-of-the-box (zero-shot), language-conditioned, and fine-tuned to new tasks.
Without any task-specific fine-tuning, pi-0 outperforms all baselines (OpenVLA, Octo, pi-0-small) across every task. Even a version trained for only 160K steps (matching the baselines' training budget) still wins, showing the advantage is architectural, not just from more training.
What actually happens on the robot at runtime:
Total latency from observation to first action: ~40ms. Effective control loop: closed at 2 Hz (re-plan every 500ms), but actions execute at 50 Hz within each chunk. This pipelined approach means the robot never waits for computation — there's always an action chunk ready to execute.
When given intermediate language commands from a human expert ("pick up the red plate, put it in the bin"), pi-0 significantly outperforms pi-0-small (which lacks VLM pre-training). This confirms that the VLM backbone's language understanding directly translates to better instruction following.
pi-0 can be efficiently fine-tuned to entirely new tasks not seen during pre-training. Even "hard" tasks (like paper towel replacement — no similar task in pre-training) achieve strong performance with moderate amounts of fine-tuning data.
How much data does post-training actually need? The paper shows a clear data-efficiency curve:
| Post-training Data | Task | Success Rate |
|---|---|---|
| 0 demos (zero-shot) | Simple pick-and-place | ~65% (from pre-training alone) |
| 50 demos (~2 hours) | Simple pick-and-place | ~85% |
| 200 demos (~8 hours) | Laundry folding | ~60% |
| 500 demos (~20 hours) | Laundry folding | ~80% |
| 1000 demos (~40 hours) | Laundry folding | ~85% |
The key insight: pre-training provides diminishing marginal returns from post-training data. Without pre-training, 200 folding demos give ~20% success. With pre-training, the same 200 demos give ~60%. The pre-trained model already knows what fabric looks like, how to plan grasps, and how to maintain smooth trajectories. Post-training just needs to teach the specific fold sequences.
pi-0 is the foundation that spawned an entire family of models from Physical Intelligence:
To understand why pi-0 works, it helps to ask what happens if you remove or change each component:
| Change | What Happens | Why |
|---|---|---|
| Remove VLM pre-training | Language following breaks. Novel object recognition fails. Task success drops ~30 pp. | No visual-semantic grounding. Model must learn "what is a cup" from robot data alone. |
| Replace flow matching with DDPM | Inference becomes 10x slower. Must reduce to 5 Hz control. Dexterous tasks degrade. | DDPM needs 50-100 steps vs flow matching's 10. Can't meet 50 Hz budget. |
| Remove action expert (single model) | VLM's language understanding degrades after 50K steps. Catastrophic forgetting. | Action-specific gradients corrupt VLM weights optimized for language. |
| Single-step prediction (H=1) | Jerky motions. Fabric folding impossible. Temporal coherence lost. | No ability to plan smooth trajectories longer than 20ms. |
| Remove cross-embodiment data | Performance on trained tasks unchanged. But fine-tuning to new robots requires more data. | Cross-embodiment provides general manipulation priors, not task-specific skills. |
| Replace PaLiGemma 3B with 7B VLM | Better language understanding, but inference too slow for 50 Hz. Must drop to 10 Hz. | Larger model = more flops per forward pass. Real-time constraint is hard. |
This table reveals that pi-0's design is tightly constrained. Every choice is load-bearing. The 50 Hz real-time requirement alone eliminates most alternatives.
Running pi-0 on a robot requires careful memory management:
| Component | GPU Memory | Notes |
|---|---|---|
| SigLIP encoder (400M params) | ~1.6 GB (FP16) | Frozen, loaded once |
| Gemma backbone (2.4B params) | ~4.8 GB (FP16) | KV-cache adds ~0.5 GB |
| Action expert (600M params) | ~1.2 GB (FP16) | Active during denoising |
| Action tokens (50 x 2048) | ~0.4 GB | 10 copies for denoising steps |
| Total | ~8.5 GB | Fits on A100 (80 GB) with room |
The model comfortably fits on a single A100. For edge deployment, INT8 quantization halves the memory to ~4.3 GB, fitting on an NVIDIA Jetson Orin (64 GB shared). However, quantization slightly degrades action precision — acceptable for coarse tasks but problematic for fine dexterity. The paper doesn't quantize for dexterous task evaluations.
With hindsight, pi-0 had several limitations that became clear in deployment:
Looking back, the key decisions that made pi-0 work — and that subsequent models inherited or modified:
| Decision | Alternative Considered | Why pi-0's Choice Won |
|---|---|---|
| Flow matching (not DDPM) | Diffusion Policy uses DDPM | 10x fewer steps = real-time capable |
| Action expert (separate weights) | Single unified model | Prevents catastrophic forgetting |
| Action chunks (H=50) | Single-step prediction | Temporal coherence for smooth motion |
| Joint angles (not end-effector) | Cartesian control | Full expressivity for dexterous tasks |
| PaLiGemma 3B (not 7B+) | Larger VLMs | Real-time inference budget |
| Freeze VLM in post-training | Full fine-tuning | Retains generalization |
The real-time requirement of 50 Hz control (20ms per inference cycle) is the single most constraining design decision in pi-0. Here's what it rules out at inference time:
These constraints are why pi-0's architecture is so carefully optimized. Every millisecond matters. The architecture isn't just a good design — it's the only design that meets the real-time budget while maintaining VLM-level understanding.
| Model | Action Repr. | Key Difference from pi-0 |
|---|---|---|
| RT-2 | Discrete tokens (256 bins) | Coarse actions, no dexterity |
| OpenVLA | Discrete tokens | Open-source but limited precision |
| Octo | Diffusion | Diffusion head but smaller model |
| pi-0 | Flow matching | VLM + flow matching + action expert |