VLAs are reactive -- they see the current frame and act, with no memory of what happened before. MEM adds two memory systems: short-term video memory for recent context and long-term language memory for semantic history. Together they enable 15-minute tasks and in-context adaptation.
Current VLAs are amnesiac. They take in the current camera frame, the task instruction, and proprioceptive state. They output the next action. Then they forget everything. The next timestep, they start from scratch with a new frame.
This works for simple pick-and-place tasks where all information is visible in the current frame. But consider a real kitchen task: "Make coffee and bring it to the desk." The robot needs to remember which drawer it already checked for coffee pods (short-term), and that the human mentioned "the blue mug, not the red one" two minutes ago (long-term).
Without memory, the robot is like a person with complete anterograde amnesia performing a 15-minute task. Every few hundred milliseconds, they lose all context and must re-derive everything from what they can currently see. Simple tasks survive this handicap. Complex, multi-step tasks become impossible.
A robot searches cabinets for an object. Without memory, it re-checks the same cabinets. Click "Step" to advance.
MEM's first memory system captures recent visual context. Instead of processing only the current frame, the model compresses the last few seconds of video into a compact representation.
How? Through a video encoder. MEM takes the most recent N frames (e.g., the last 5-10 seconds of footage at a subsampled rate) and processes them through a temporal video encoder. This encoder is designed to compress video efficiently -- it captures motion patterns, object movements, and scene changes that a single frame would miss.
The key design choice: the video encoder produces a fixed-size output regardless of how many frames it processes. 10 frames or 100 frames get compressed to the same number of tokens. This keeps the computational cost constant as the video history grows.
Think of short-term memory as peripheral vision through time. You're not consciously cataloging every detail, but you have a rich sense of what just happened -- enough to maintain continuity and react to ongoing dynamics.
Short-term video memory captures the last few seconds. But a 15-minute task needs context from minutes ago. You can't process 15 minutes of video -- that's thousands of frames and would overwhelm any transformer.
MEM's solution: language summaries. Periodically (every 10-30 seconds), a separate VLM summarizes what just happened into a natural language description: "Opened the left cabinet. Found coffee pods on the second shelf. Took one pod. Closed cabinet." These summaries are accumulated over the task lifetime.
Why language? Because language is an extraordinarily efficient compression of visual experience. The sentence "Opened the left cabinet and found coffee pods" compresses thousands of video frames into a handful of tokens. And the VLA already understands language -- it was pre-trained on billions of text tokens. So language summaries slot naturally into the model's existing input format.
The summaries are generated by a frozen VLM (not the robot's own policy). This VLM watches the video stream and periodically produces summaries. The summaries are prepended to the robot's instruction prompt, creating a growing narrative of the task history.
MEM combines both memory types into a single architecture. The input to the VLA at each timestep becomes:
The key insight is that these memories operate at different time scales:
| Memory type | Time scale | Resolution | Format |
|---|---|---|---|
| Current frame | ~40 ms | Pixel-level | Image tokens |
| Short-term (video) | 5-10 sec | Motion-level | Compressed video tokens |
| Long-term (language) | Minutes | Semantic-level | Language tokens |
Drag the time slider to see what each memory system "sees" at different points in a 15-minute task.
Many real tasks are partially observable -- the robot can't see everything it needs from its current viewpoint. A cup might be hidden behind a box. The contents of a closed drawer are invisible. The human's instruction from 30 seconds ago is no longer on screen.
Without memory, partial observability is devastating. The robot has no way to know what's inside a drawer it opened and closed 10 seconds ago. It must either keep the drawer open (impractical) or re-open it every time it needs to know the contents (wasteful and slow).
MEM's dual memory system handles both forms of partial observability:
The headline result: MEM enables robots to perform tasks lasting up to 15 minutes with dozens of subtasks. Previous VLAs maxed out at 1-2 minutes on structured tasks. MEM extends the horizon by an order of magnitude.
Consider the full task "Clean the kitchen counter": (1) pick up dirty dishes, (2) load the dishwasher, (3) wipe the counter with a cloth, (4) put away leftover food in the fridge, (5) organize remaining items. Each subtask is a manipulation challenge in itself. But the sequencing -- knowing what's done, what's next, and what to skip -- requires memory that spans the entire 15 minutes.
Without memory, a VLA might complete subtask 1, then start subtask 3, then redo subtask 1 because it forgot it already did it. With MEM's long-term memory, the language summary accumulates: "Loaded 3 dishes into dishwasher. Wiped left half of counter." The robot reads this narrative and knows exactly where it left off.
MEM is evaluated on tasks that specifically require memory -- tasks where the information needed to act correctly is not available in the current frame.
Tasks where the robot must remember the location of objects it previously saw but that are now hidden. Example: "Put the red block in the box" where the red block was visible 30 seconds ago but is now behind an occluder. Without memory: random search. With MEM: direct retrieval from the remembered location.
Tasks with 5-10 sequential subtasks spanning 10-15 minutes. MEM significantly outperforms memoryless baselines, with the gap widening as task length increases. On 2-minute tasks, the memoryless baseline is only slightly worse. On 15-minute tasks, it fails almost completely while MEM maintains reasonable success rates.
| Configuration | Short tasks (2 min) | Long tasks (15 min) |
|---|---|---|
| No memory (baseline) | Good | Fails |
| Short-term only | Better | Poor (forgets early steps) |
| Long-term only | Good (less needed) | Moderate (misses dynamics) |
| Both (MEM) | Best | Best by far |
Memory enables something unexpected: in-context adaptation. If the robot can remember what happened earlier in the episode, it can learn from its own mistakes within a single task execution.
Consider a robot trying to open a sticky drawer. On the first attempt, it pulls with normal force and fails -- the drawer doesn't open. Without memory, it tries the same thing again (and again, and again). With MEM, the long-term memory records "attempted to open drawer with normal force -- failed." On the next attempt, the VLA reads this context and can adjust its strategy -- pulling harder or trying a different grip.
This is not explicit learning or optimization. The model doesn't update its weights. Instead, it's in-context learning -- the same phenomenon that lets language models improve their answers when given examples in the prompt. The memory acts as additional prompt context that informs future decisions.
The paper shows that MEM-equipped robots exhibit qualitatively different behavior on repeated attempts: they try alternative approaches, avoid previously failed strategies, and adapt to unexpected object properties. Memoryless robots show no such adaptation.
MEM sits at the intersection of robot learning and cognitive science. The dual memory system mirrors a long tradition of research on human memory architectures.
| MEM component | Cognitive parallel | ML parallel |
|---|---|---|
| Current frame | Iconic memory (~250ms) | Standard VLA input |
| Short-term video | Working memory (~seconds) | RNN hidden state / frame stacking |
| Long-term language | Episodic memory (~hours) | Retrieval-augmented generation (RAG) |
In ML, the closest analogue is retrieval-augmented generation (RAG). RAG stores documents in a database and retrieves relevant ones at query time. MEM stores experience summaries and retrieves all of them at action time. The difference: RAG uses embedding-based retrieval, while MEM simply concatenates all summaries (the total is short enough to fit in context).
A 15-minute task timeline. Hover to see what each memory layer contains at any moment. Green events are remembered, red are forgotten.