Physical Intelligence + Stanford, 2025

MEM: Multi-Scale
Embodied Memory

VLAs are reactive -- they see the current frame and act, with no memory of what happened before. MEM adds two memory systems: short-term video memory for recent context and long-term language memory for semantic history. Together they enable 15-minute tasks and in-context adaptation.

Prerequisites: VLA basics + Transformer concepts
9
Chapters
3+
Simulations

Chapter 0: The Memory Problem

Current VLAs are amnesiac. They take in the current camera frame, the task instruction, and proprioceptive state. They output the next action. Then they forget everything. The next timestep, they start from scratch with a new frame.

This works for simple pick-and-place tasks where all information is visible in the current frame. But consider a real kitchen task: "Make coffee and bring it to the desk." The robot needs to remember which drawer it already checked for coffee pods (short-term), and that the human mentioned "the blue mug, not the red one" two minutes ago (long-term).

Without memory, the robot is like a person with complete anterograde amnesia performing a 15-minute task. Every few hundred milliseconds, they lose all context and must re-derive everything from what they can currently see. Simple tasks survive this handicap. Complex, multi-step tasks become impossible.

The observation window problem: Current VLAs process 1-2 frames (0.04-0.08 seconds of history). A 15-minute task spans 22,500 frames at 25 Hz. The model sees 0.004% of the task history at any moment. It's like reading a novel one word at a time while forgetting all previous words.
Memory vs No Memory

A robot searches cabinets for an object. Without memory, it re-checks the same cabinets. Click "Step" to advance.

Why do current VLAs fail at long-horizon tasks?

Chapter 1: Short-Term Memory

MEM's first memory system captures recent visual context. Instead of processing only the current frame, the model compresses the last few seconds of video into a compact representation.

How? Through a video encoder. MEM takes the most recent N frames (e.g., the last 5-10 seconds of footage at a subsampled rate) and processes them through a temporal video encoder. This encoder is designed to compress video efficiently -- it captures motion patterns, object movements, and scene changes that a single frame would miss.

The key design choice: the video encoder produces a fixed-size output regardless of how many frames it processes. 10 frames or 100 frames get compressed to the same number of tokens. This keeps the computational cost constant as the video history grows.

What short-term memory captures: Object trajectories (where was the cup moving?), manipulation progress (how far along is the fold?), immediate dynamics (is the object slipping?), and scene changes (did someone place a new object on the table?). All of these are invisible in a single frame but obvious in a short video clip.

Think of short-term memory as peripheral vision through time. You're not consciously cataloging every detail, but you have a rich sense of what just happened -- enough to maintain continuity and react to ongoing dynamics.

Why does MEM's video encoder produce a fixed-size output regardless of input length?

Chapter 2: Long-Term Memory

Short-term video memory captures the last few seconds. But a 15-minute task needs context from minutes ago. You can't process 15 minutes of video -- that's thousands of frames and would overwhelm any transformer.

MEM's solution: language summaries. Periodically (every 10-30 seconds), a separate VLM summarizes what just happened into a natural language description: "Opened the left cabinet. Found coffee pods on the second shelf. Took one pod. Closed cabinet." These summaries are accumulated over the task lifetime.

Why language? Because language is an extraordinarily efficient compression of visual experience. The sentence "Opened the left cabinet and found coffee pods" compresses thousands of video frames into a handful of tokens. And the VLA already understands language -- it was pre-trained on billions of text tokens. So language summaries slot naturally into the model's existing input format.

The compression is staggering: 30 seconds of video at 25 Hz = 750 frames. As raw pixel tokens, that might be 750 x 256 = 192,000 tokens. As a language summary, it's ~20 tokens. That's a 10,000x compression ratio. Of course, you lose fine-grained detail. But for long-term context, you don't need pixel-level detail -- you need semantic facts like "the cup is in the left cabinet."

The summaries are generated by a frozen VLM (not the robot's own policy). This VLM watches the video stream and periodically produces summaries. The summaries are prepended to the robot's instruction prompt, creating a growing narrative of the task history.

Why does MEM use language for long-term memory instead of compressed video features?

Chapter 3: Multi-Scale Architecture

MEM combines both memory types into a single architecture. The input to the VLA at each timestep becomes:

Long-term memory
Language summaries of past events (minutes ago). ~20-100 tokens covering the entire task history.
Task instruction
"Make coffee and bring it to the desk." The original user command.
Short-term memory
Video encoder output from last 5-10 seconds. Fixed-size compressed tokens.
Current observation
Current camera frame + proprioceptive state. Standard VLA input.
Action output
Next action chunk (flow matching or autoregressive).

The key insight is that these memories operate at different time scales:

Memory typeTime scaleResolutionFormat
Current frame~40 msPixel-levelImage tokens
Short-term (video)5-10 secMotion-levelCompressed video tokens
Long-term (language)MinutesSemantic-levelLanguage tokens
The multi-scale design mirrors human memory: You see the current scene in sharp detail (iconic memory, ~ms). You have a rich sense of the last few seconds (working memory, ~seconds). And you remember key facts from earlier (episodic memory, ~minutes). Each scale trades resolution for temporal reach -- exactly what MEM does.
Memory at Multiple Time Scales

Drag the time slider to see what each memory system "sees" at different points in a 15-minute task.

Time0:00
Why does MEM use different memory formats at different time scales?

Chapter 4: Handling Partial Observability

Many real tasks are partially observable -- the robot can't see everything it needs from its current viewpoint. A cup might be hidden behind a box. The contents of a closed drawer are invisible. The human's instruction from 30 seconds ago is no longer on screen.

Without memory, partial observability is devastating. The robot has no way to know what's inside a drawer it opened and closed 10 seconds ago. It must either keep the drawer open (impractical) or re-open it every time it needs to know the contents (wasteful and slow).

MEM's dual memory system handles both forms of partial observability:

This is exactly the POMDP problem: In a partially observable Markov decision process, the agent maintains a belief state that integrates evidence over time. MEM's memory is the belief state -- short-term video memory is the recent evidence, long-term language memory is the accumulated belief. The VLA's policy reads from this belief state instead of the raw observation.
How does MEM's long-term memory help with partial observability?

Chapter 5: 15-Minute Tasks

The headline result: MEM enables robots to perform tasks lasting up to 15 minutes with dozens of subtasks. Previous VLAs maxed out at 1-2 minutes on structured tasks. MEM extends the horizon by an order of magnitude.

Consider the full task "Clean the kitchen counter": (1) pick up dirty dishes, (2) load the dishwasher, (3) wipe the counter with a cloth, (4) put away leftover food in the fridge, (5) organize remaining items. Each subtask is a manipulation challenge in itself. But the sequencing -- knowing what's done, what's next, and what to skip -- requires memory that spans the entire 15 minutes.

Without memory, a VLA might complete subtask 1, then start subtask 3, then redo subtask 1 because it forgot it already did it. With MEM's long-term memory, the language summary accumulates: "Loaded 3 dishes into dishwasher. Wiped left half of counter." The robot reads this narrative and knows exactly where it left off.

Previous longest VLA tasks: pi-0 demonstrated 10-minute tasks (laundry folding), but these were single continuous manipulation sequences, not multi-step plans requiring memory of completed subtasks. MEM enables truly compositional long-horizon behavior where the robot must track progress across many distinct subtasks.
What specific capability does long-term memory add for 15-minute multi-step tasks?

Chapter 6: Results

MEM is evaluated on tasks that specifically require memory -- tasks where the information needed to act correctly is not available in the current frame.

Memory-dependent manipulation

Tasks where the robot must remember the location of objects it previously saw but that are now hidden. Example: "Put the red block in the box" where the red block was visible 30 seconds ago but is now behind an occluder. Without memory: random search. With MEM: direct retrieval from the remembered location.

Multi-step sequential tasks

Tasks with 5-10 sequential subtasks spanning 10-15 minutes. MEM significantly outperforms memoryless baselines, with the gap widening as task length increases. On 2-minute tasks, the memoryless baseline is only slightly worse. On 15-minute tasks, it fails almost completely while MEM maintains reasonable success rates.

Ablations: both memory types matter

ConfigurationShort tasks (2 min)Long tasks (15 min)
No memory (baseline)GoodFails
Short-term onlyBetterPoor (forgets early steps)
Long-term onlyGood (less needed)Moderate (misses dynamics)
Both (MEM)BestBest by far
The ablation tells the story: Short-term memory alone helps with dynamic tracking but loses context over minutes. Long-term memory alone captures the big picture but misses recent dynamics. Together, they cover the full temporal range, and the combination is substantially better than either alone, especially on long tasks.
What does the ablation study reveal about the two memory types?

Chapter 7: In-Context Adaptation

Memory enables something unexpected: in-context adaptation. If the robot can remember what happened earlier in the episode, it can learn from its own mistakes within a single task execution.

Consider a robot trying to open a sticky drawer. On the first attempt, it pulls with normal force and fails -- the drawer doesn't open. Without memory, it tries the same thing again (and again, and again). With MEM, the long-term memory records "attempted to open drawer with normal force -- failed." On the next attempt, the VLA reads this context and can adjust its strategy -- pulling harder or trying a different grip.

This is not explicit learning or optimization. The model doesn't update its weights. Instead, it's in-context learning -- the same phenomenon that lets language models improve their answers when given examples in the prompt. The memory acts as additional prompt context that informs future decisions.

The parallel to few-shot prompting: In language models, providing examples in the prompt dramatically improves performance on novel tasks. MEM's memory summaries act as "examples from the robot's own experience" -- context that helps the VLA make better decisions without any weight updates. This is few-shot learning from self-experience.

The paper shows that MEM-equipped robots exhibit qualitatively different behavior on repeated attempts: they try alternative approaches, avoid previously failed strategies, and adapt to unexpected object properties. Memoryless robots show no such adaptation.

How does MEM enable in-context adaptation without updating model weights?

Chapter 8: Connections

MEM sits at the intersection of robot learning and cognitive science. The dual memory system mirrors a long tradition of research on human memory architectures.

MEM componentCognitive parallelML parallel
Current frameIconic memory (~250ms)Standard VLA input
Short-term videoWorking memory (~seconds)RNN hidden state / frame stacking
Long-term languageEpisodic memory (~hours)Retrieval-augmented generation (RAG)

In ML, the closest analogue is retrieval-augmented generation (RAG). RAG stores documents in a database and retrieves relevant ones at query time. MEM stores experience summaries and retrieves all of them at action time. The difference: RAG uses embedding-based retrieval, while MEM simply concatenates all summaries (the total is short enough to fit in context).

Memory Timeline Visualization

A 15-minute task timeline. Hover to see what each memory layer contains at any moment. Green events are remembered, red are forgotten.

Current time0:00
Related lessons: pi-0pi-0.5Human-to-Robot TransferGleams: VLA
"The palest ink is better than the best memory."
— Chinese proverb, capturing why explicit memory representations outperform implicit ones