Physical Intelligence + Stanford, 2025

MEM: Multi-Scale
Embodied Memory

VLAs are reactive -- they see the current frame and act, with no memory of what happened before. MEM adds two memory systems: short-term video memory for recent context and long-term language memory for semantic history. Together they enable 15-minute tasks and in-context adaptation.

Prerequisites: VLA basics + Transformer concepts

Chapters

Simulations

Chapter 0: The Memory Problem

Current VLAs are amnesiac. They take in the current camera frame, the task instruction, and proprioceptive state. They output the next action. Then they forget everything. The next timestep, they start from scratch with a new frame.

This works for simple pick-and-place tasks where all information is visible in the current frame. But consider a real kitchen task: "Make coffee and bring it to the desk." The robot needs to remember which drawer it already checked for coffee pods (short-term), and that the human mentioned "the blue mug, not the red one" two minutes ago (long-term).

Without memory, the robot is like a person with complete anterograde amnesia performing a 15-minute task. Every few hundred milliseconds, they lose all context and must re-derive everything from what they can currently see. Simple tasks survive this handicap. Complex, multi-step tasks become impossible.

The observation window problem: Current VLAs process 1-2 frames (0.04-0.08 seconds of history). A 15-minute task spans 22,500 frames at 25 Hz. The model sees 0.004% of the task history at any moment. It's like reading a novel one word at a time while forgetting all previous words.

Memory vs No Memory

A robot searches cabinets for an object. Without memory, it re-checks the same cabinets. Click "Step" to advance.

Why do current VLAs fail at long-horizon tasks?

They have no memory -- each timestep only sees the current frame, so they can't track what happened before, which steps are complete, or what information was gathered earlier Their action space is too small The transformer is too slow

Chapter 1: Short-Term Memory

MEM's first memory system captures recent visual context. Instead of processing only the current frame, the model compresses the last few seconds of video into a compact representation.

How? Through a video encoder. MEM takes the most recent N frames (e.g., the last 5-10 seconds of footage at a subsampled rate) and processes them through a temporal video encoder. This encoder is designed to compress video efficiently -- it captures motion patterns, object movements, and scene changes that a single frame would miss.

The key design choice: the video encoder produces a fixed-size output regardless of how many frames it processes. 10 frames or 100 frames get compressed to the same number of tokens. This keeps the computational cost constant as the video history grows.

What short-term memory captures: Object trajectories (where was the cup moving?), manipulation progress (how far along is the fold?), immediate dynamics (is the object slipping?), and scene changes (did someone place a new object on the table?). All of these are invisible in a single frame but obvious in a short video clip.

Think of short-term memory as peripheral vision through time. You're not consciously cataloging every detail, but you have a rich sense of what just happened -- enough to maintain continuity and react to ongoing dynamics.

Why does MEM's video encoder produce a fixed-size output regardless of input length?

To simplify training data To keep computational cost constant as the history window grows -- the VLA's context length doesn't explode with longer video histories Fixed outputs are easier to store

Chapter 2: Long-Term Memory

Short-term video memory captures the last few seconds. But a 15-minute task needs context from minutes ago. You can't process 15 minutes of video -- that's thousands of frames and would overwhelm any transformer.

MEM's solution: language summaries. Periodically (every 10-30 seconds), a separate VLM summarizes what just happened into a natural language description: "Opened the left cabinet. Found coffee pods on the second shelf. Took one pod. Closed cabinet." These summaries are accumulated over the task lifetime.

Why language? Because language is an extraordinarily efficient compression of visual experience. The sentence "Opened the left cabinet and found coffee pods" compresses thousands of video frames into a handful of tokens. And the VLA already understands language -- it was pre-trained on billions of text tokens. So language summaries slot naturally into the model's existing input format.

The compression is staggering: 30 seconds of video at 25 Hz = 750 frames. As raw pixel tokens, that might be 750 x 256 = 192,000 tokens. As a language summary, it's ~20 tokens. That's a 10,000x compression ratio. Of course, you lose fine-grained detail. But for long-term context, you don't need pixel-level detail -- you need semantic facts like "the cup is in the left cabinet."

The summaries are generated by a frozen VLM (not the robot's own policy). This VLM watches the video stream and periodically produces summaries. The summaries are prepended to the robot's instruction prompt, creating a growing narrative of the task history.

Why does MEM use language for long-term memory instead of compressed video features?

Language achieves ~10,000x compression over raw video, captures semantic facts that matter for planning, and slots into the VLA's existing language understanding without architectural changes Language models are faster than video models Video features are too noisy

Chapter 3: Multi-Scale Architecture

MEM combines both memory types into a single architecture. The input to the VLA at each timestep becomes:

Long-term memory

Language summaries of past events (minutes ago). ~20-100 tokens covering the entire task history.

↓

Task instruction

"Make coffee and bring it to the desk." The original user command.

↓

Short-term memory

Video encoder output from last 5-10 seconds. Fixed-size compressed tokens.

↓

Current observation

Current camera frame + proprioceptive state. Standard VLA input.

↓

Action output

Next action chunk (flow matching or autoregressive).

The key insight is that these memories operate at different time scales:

Memory type	Time scale	Resolution	Format
Current frame	~40 ms	Pixel-level	Image tokens
Short-term (video)	5-10 sec	Motion-level	Compressed video tokens
Long-term (language)	Minutes	Semantic-level	Language tokens

The multi-scale design mirrors human memory: You see the current scene in sharp detail (iconic memory, ~ms). You have a rich sense of the last few seconds (working memory, ~seconds). And you remember key facts from earlier (episodic memory, ~minutes). Each scale trades resolution for temporal reach -- exactly what MEM does.

Memory at Multiple Time Scales

Drag the time slider to see what each memory system "sees" at different points in a 15-minute task.

Time0:00

Why does MEM use different memory formats at different time scales?

Each scale trades resolution for temporal reach: pixel-level detail for the present, motion-level for recent seconds, semantic-level for minutes -- matching the information needs at each scale Different formats are easier to train The VLA can only process one format at a time

Chapter 4: Handling Partial Observability

Many real tasks are partially observable -- the robot can't see everything it needs from its current viewpoint. A cup might be hidden behind a box. The contents of a closed drawer are invisible. The human's instruction from 30 seconds ago is no longer on screen.

Without memory, partial observability is devastating. The robot has no way to know what's inside a drawer it opened and closed 10 seconds ago. It must either keep the drawer open (impractical) or re-open it every time it needs to know the contents (wasteful and slow).

MEM's dual memory system handles both forms of partial observability:

Temporarily occluded objects: Short-term video memory tracks objects that were visible a few seconds ago but are currently hidden. "The ball rolled behind the box 2 seconds ago" is captured by the video encoder even though the current frame shows only the box.
Previously gathered information: Long-term language memory stores facts from earlier in the task. "The coffee pods are in the left cabinet, second shelf" persists in the language summary even after the cabinet is closed.

This is exactly the POMDP problem: In a partially observable Markov decision process, the agent maintains a belief state that integrates evidence over time. MEM's memory is the belief state -- short-term video memory is the recent evidence, long-term language memory is the accumulated belief. The VLA's policy reads from this belief state instead of the raw observation.

How does MEM's long-term memory help with partial observability?

Language summaries store facts about previously observed but currently invisible information, like the contents of a closed drawer, so the robot doesn't need to re-observe to know It makes the camera resolution higher It adds more cameras to the robot

Chapter 5: 15-Minute Tasks

The headline result: MEM enables robots to perform tasks lasting up to 15 minutes with dozens of subtasks. Previous VLAs maxed out at 1-2 minutes on structured tasks. MEM extends the horizon by an order of magnitude.

Consider the full task "Clean the kitchen counter": (1) pick up dirty dishes, (2) load the dishwasher, (3) wipe the counter with a cloth, (4) put away leftover food in the fridge, (5) organize remaining items. Each subtask is a manipulation challenge in itself. But the sequencing -- knowing what's done, what's next, and what to skip -- requires memory that spans the entire 15 minutes.

Without memory, a VLA might complete subtask 1, then start subtask 3, then redo subtask 1 because it forgot it already did it. With MEM's long-term memory, the language summary accumulates: "Loaded 3 dishes into dishwasher. Wiped left half of counter." The robot reads this narrative and knows exactly where it left off.

Previous longest VLA tasks: pi-0 demonstrated 10-minute tasks (laundry folding), but these were single continuous manipulation sequences, not multi-step plans requiring memory of completed subtasks. MEM enables truly compositional long-horizon behavior where the robot must track progress across many distinct subtasks.

What specific capability does long-term memory add for 15-minute multi-step tasks?

Faster action generation Better camera resolution Tracking which subtasks are complete and which remain, so the robot doesn't repeat or skip steps -- the language summaries serve as a running progress log

Chapter 6: Results

MEM is evaluated on tasks that specifically require memory -- tasks where the information needed to act correctly is not available in the current frame.

Memory-dependent manipulation

Tasks where the robot must remember the location of objects it previously saw but that are now hidden. Example: "Put the red block in the box" where the red block was visible 30 seconds ago but is now behind an occluder. Without memory: random search. With MEM: direct retrieval from the remembered location.

Multi-step sequential tasks

Tasks with 5-10 sequential subtasks spanning 10-15 minutes. MEM significantly outperforms memoryless baselines, with the gap widening as task length increases. On 2-minute tasks, the memoryless baseline is only slightly worse. On 15-minute tasks, it fails almost completely while MEM maintains reasonable success rates.

Ablations: both memory types matter

Configuration	Short tasks (2 min)	Long tasks (15 min)
No memory (baseline)	Good	Fails
Short-term only	Better	Poor (forgets early steps)
Long-term only	Good (less needed)	Moderate (misses dynamics)
Both (MEM)	Best	Best by far

The ablation tells the story: Short-term memory alone helps with dynamic tracking but loses context over minutes. Long-term memory alone captures the big picture but misses recent dynamics. Together, they cover the full temporal range, and the combination is substantially better than either alone, especially on long tasks.

What does the ablation study reveal about the two memory types?

Both are necessary -- short-term captures recent dynamics while long-term preserves facts over minutes. The gap between "both" and "either alone" grows with task length Short-term memory is sufficient for all tasks Long-term memory is more important in all cases

Chapter 7: In-Context Adaptation

Memory enables something unexpected: in-context adaptation. If the robot can remember what happened earlier in the episode, it can learn from its own mistakes within a single task execution.

Consider a robot trying to open a sticky drawer. On the first attempt, it pulls with normal force and fails -- the drawer doesn't open. Without memory, it tries the same thing again (and again, and again). With MEM, the long-term memory records "attempted to open drawer with normal force -- failed." On the next attempt, the VLA reads this context and can adjust its strategy -- pulling harder or trying a different grip.

This is not explicit learning or optimization. The model doesn't update its weights. Instead, it's in-context learning -- the same phenomenon that lets language models improve their answers when given examples in the prompt. The memory acts as additional prompt context that informs future decisions.

The parallel to few-shot prompting: In language models, providing examples in the prompt dramatically improves performance on novel tasks. MEM's memory summaries act as "examples from the robot's own experience" -- context that helps the VLA make better decisions without any weight updates. This is few-shot learning from self-experience.

The paper shows that MEM-equipped robots exhibit qualitatively different behavior on repeated attempts: they try alternative approaches, avoid previously failed strategies, and adapt to unexpected object properties. Memoryless robots show no such adaptation.

How does MEM enable in-context adaptation without updating model weights?

Memory summaries of past attempts serve as in-context examples -- the VLA reads "tried X, it failed" and adjusts its strategy, just like a language model improves with few-shot examples in the prompt The model fine-tunes itself during the episode A separate planning module selects alternative strategies

Chapter 8: Connections

MEM sits at the intersection of robot learning and cognitive science. The dual memory system mirrors a long tradition of research on human memory architectures.

MEM component	Cognitive parallel	ML parallel
Current frame	Iconic memory (~250ms)	Standard VLA input
Short-term video	Working memory (~seconds)	RNN hidden state / frame stacking
Long-term language	Episodic memory (~hours)	Retrieval-augmented generation (RAG)

In ML, the closest analogue is retrieval-augmented generation (RAG). RAG stores documents in a database and retrieves relevant ones at query time. MEM stores experience summaries and retrieves all of them at action time. The difference: RAG uses embedding-based retrieval, while MEM simply concatenates all summaries (the total is short enough to fit in context).

Memory Timeline Visualization

A 15-minute task timeline. Hover to see what each memory layer contains at any moment. Green events are remembered, red are forgotten.

Current time0:00

Related lessons: pi-0 • pi-0.5 • Human-to-Robot Transfer • Gleams: VLA

"The palest ink is better than the best memory."

— Chinese proverb, capturing why explicit memory representations outperform implicit ones

MEM: Multi-ScaleEmbodied Memory