After every failure, the agent writes itself a critique. That critique lives in memory and shapes the next attempt. No gradient descent. No weight updates. Just words, stored and reused — verbal reinforcement learning.
You're playing a text-based game. You try to pick up a key, but instead you walk into a wall. You try again — same result. The agent has no memory of its failure. Each attempt starts fresh, making the same mistake in the same room with the same wrong action. It's learning nothing.
This is the fundamental problem with single-episode LLM agents. Each time the task resets, so does the agent's knowledge. A human playing the same game would think: "Last time I walked north when I meant to pick up the key — I should try 'pick up key' directly." The agent cannot do this because it has no persistent memory of what went wrong.
Three tasks make this concrete in the Reflexion paper:
Each task provides a natural evaluation: success or failure with a binary signal. Reflexion uses that signal to trigger self-reflection. The agent fails, writes a critique, and tries again — up to a fixed number of trials. It is learning without learning — improving performance without touching a single weight.
An agent trying a task repeatedly without memory repeats the same mistakes. Click "Attempt" to simulate a naive agent — watch it loop. Then enable Reflexion memory and see how each failure improves the next try.
After a failed attempt, Reflexion asks the agent to do something humans do naturally: think about what went wrong. Given the task, the trajectory (every action and observation from the failed attempt), and the final outcome, the agent produces a verbal reflection — a free-text diagnosis of the failure and a strategy for the next attempt.
What makes a good reflection? The paper doesn't specify a fixed format — it uses few-shot examples of (trajectory, outcome) → reflection pairs to teach the LLM what a useful critique looks like. Good reflections tend to:
The reflection is written by the same LLM that performed the task — there is no separate critic model. This works because LLMs have excellent meta-cognitive capabilities: they've seen enormous amounts of human writing about mistakes, corrections, and strategies. Given a failure, they can identify plausible explanations and better approaches.
python (pseudocode) def generate_reflection(task, trajectory, outcome, llm): # Build the reflection prompt prompt = REFLECT_TEMPLATE.format( task=task, trajectory=format_traj(trajectory), # all thought/action/obs outcome=outcome, # e.g., "FAILED: wrong answer" examples=FEW_SHOT_REFLECTIONS # 1-3 examples of good critiques ) # LLM generates verbal critique reflection = llm.generate(prompt) return reflection # e.g., "I searched too broadly. Next time focus on..."
Reflexion wraps any base agent (like ReAct) in an outer loop. The inner loop is the agent solving the task — taking actions, observing outcomes, thinking through steps. The outer loop handles failure: generate a reflection, store it in memory, reset the environment, and try again.
Watch the full loop unfold below. Each trial shows what the agent does differently because of its accumulated reflections. You can control how many trials run and see the memory fill up.
A coding task: write a function to find the second largest element. Watch the agent fail, reflect, and improve. Each trial's memory panel shows what reflections are carried forward.
python def reflexion_agent(task, env, llm, max_trials=4): memory = [] # persistent verbal memory across trials for trial in range(max_trials): # Build context: task + all past reflections context = build_context(task, memory, FEW_SHOT) # Run inner agent (e.g. ReAct) to completion trajectory, success = run_agent(llm, env, context) if success: return trajectory # done # Generate verbal reflection from failed trajectory reflection = reflect(llm, task, trajectory) memory.append(reflection) # accumulate in memory env.reset() # reset environment, not memory return "Max trials reached"
Reflexion uses three types of memory, which the paper calls short-term, long-term, and episodic. Understanding the distinction helps clarify both how Reflexion works and where it can fail.
The context window is finite. With long reflections and many trials, the episodic memory can overflow. Reflexion handles this with a simple priority: keep the most recent reflections, drop older ones if needed. Alternatively, a summarization step can compress multiple reflections into one.
| Memory Type | What It Stores | Lifetime | Editable? |
|---|---|---|---|
| Short-term | Current trial trajectory | One trial | Via generation |
| Episodic | Verbal reflections from past trials | All trials | Via appending |
| Long-term | Pre-trained LLM weights | Permanent | No (Reflexion doesn't fine-tune) |
The episodic memory architecture also explains why Reflexion scales gracefully: storing a reflection costs only tokens, not compute. Adding a third trial's reflection is as cheap as adding a sentence to a document. Compare this to RL, where each additional episode requires a full forward+backward pass through the model.
Visualize how episodic memory grows across trials. Each row is one trial's context. The memory section shows reflections carried forward — each trial sees all prior critiques.
The natural competitor for Reflexion is fine-tuning: update the model's weights using the failed trajectory as a negative training signal (or the successful trajectory as a positive one). This is the standard RL approach. Reflexion deliberately avoids this — why?
But Reflexion has its own failure modes compared to fine-tuning:
| Property | Fine-Tuning | Reflexion |
|---|---|---|
| Requires model weights | Yes | No (API-only works) |
| Compute cost per trial | Very high (full backward pass) | Low (text generation) |
| Knowledge persistence | Permanent (in weights) | Context-window-limited |
| Cross-task transfer | Yes (if trained on many tasks) | No (per-task memory) |
| Risk of forgetting | Yes (catastrophic forgetting) | None |
| Interpretability | Black box weights | Human-readable reflections |
Reflexion was evaluated on three very different task families. The diversity is the point: verbal self-reflection is a general mechanism that works wherever (1) there is a clear success/failure signal and (2) the failure modes can be diagnosed in words.
AlfWorld is a text-based household environment. 134 test tasks from 6 categories: pick up, put in, cool, heat, clean, examine. The agent must navigate rooms, interact with objects, and achieve multi-step goals in natural language. Failure modes include wrong room, wrong object, wrong sequence.
Reflexion's reflections on AlfWorld failures look like: "I picked up the pan before heating it, but the task requires heating a pan already in the microwave. I should first identify which pan is already placed, then operate the appliance." This kind of sequential reasoning correction transfers directly to the next attempt.
HotpotQA requires multi-hop reasoning over Wikipedia. Failure mode: searching too broadly or in the wrong direction. Reflexion reflects: "My search for 'French revolution leaders' was too broad. I need the specific person who was executed on Thermidor 9 — searching 'Robespierre execution date' directly would have worked." The next trial uses a targeted search strategy.
Code generation is perhaps the most natural fit. Unit tests provide a precise, binary signal. When a test fails, the failure message contains the exact input, expected output, and actual output. A reflection might say: "My solution fails for empty lists because I don't check for the edge case before calling max(). I need to add an early return for len(nums) == 0." The next attempt includes this check.
See example reflection text for each domain. The specificity and actionability of the reflection determines whether it helps the next attempt.
Across all three task families, Reflexion shows consistent improvement over the base agent (ReAct or simple prompting) with only 1-4 additional trials.
| Task | Metric | Base Agent | Reflexion (trials=4) | Gain |
|---|---|---|---|---|
| AlfWorld (avg) | Success Rate | 71% | 97% | +26% |
| HotpotQA | EM Score | 30% | 37% | +7% |
| HumanEval | pass@1 (GPT-4) | 80% | 91% | +11% |
| LeetCode (easy) | pass@1 | 40% | 57% | +17% |
The AlfWorld improvement is the most striking. A +26% absolute gain from verbal self-reflection alone, across 134 household tasks. The paper attributes this to AlfWorld's structured failure modes: the same types of mistakes (going to the wrong room, not recognizing synonyms for objects) recur across tasks. Reflexion identifies these patterns and eliminates them.
Failure analysis from the paper: tasks where Reflexion fails tend to involve hallucinated reflections ("I think I failed because X" where X is wrong), multi-step compounding errors where no single reflection captures the root cause, and tasks requiring genuine knowledge the model simply doesn't have. Verbal reflection can correct strategy; it cannot supply missing knowledge.
Success rate across trials for each task domain. Most of the gain comes from the first reflection.
Reflexion sits at the intersection of RL theory, agentic AI, and cognitive science. Its descendants and relatives reveal how the field has been building toward language-native learning.
| Method | Key Idea | vs Reflexion |
|---|---|---|
| ReAct (Yao 2022) | Interleave thought + action | Reflexion's inner loop; Reflexion adds outer loop + memory |
| Reflexion (this paper) | Verbal critique → episodic memory → next trial | — |
| Self-Refine (Madaan 2023) | Critique and refine a single output iteratively | Single-shot refinement; Reflexion uses environment feedback across episodes |
| LATS (Zhou 2023) | Monte Carlo tree search over agent trajectories | Reflexion: linear retry. LATS: tree search with backtracking. |
| Voyager (Wang 2023) | Skill library grows with reflection | Reflexion adds memories; Voyager adds reusable code skills |
| Constitutional AI (Bai 2022) | Self-critique for alignment | Different goal (safety) same mechanism (self-evaluation + revision) |
Shinn, Cassano, Labash, Gopalan, Narasimhan, Yao. "Reflexion: Language Agents with Verbal Reinforcement Learning." NeurIPS 2023. arXiv:2303.11366
"The ability to learn from failure in language — without updating a single weight — is perhaps the most human thing a language model has ever done."