Shinn, Cassano, Labash, Gopalan, Narasimhan, Yao — NeurIPS 2023

Reflexion: Verbal Reinforcement Learning

After every failure, the agent writes itself a critique. That critique lives in memory and shapes the next attempt. No gradient descent. No weight updates. Just words, stored and reused — verbal reinforcement learning.

Prerequisites: What an LLM agent is + Basic intuition about reinforcement learning. Reading ReAct first helps but isn't required.
8
Chapters
4+
Simulations
2303.11366
arXiv

Chapter 0: The Problem

You're playing a text-based game. You try to pick up a key, but instead you walk into a wall. You try again — same result. The agent has no memory of its failure. Each attempt starts fresh, making the same mistake in the same room with the same wrong action. It's learning nothing.

This is the fundamental problem with single-episode LLM agents. Each time the task resets, so does the agent's knowledge. A human playing the same game would think: "Last time I walked north when I meant to pick up the key — I should try 'pick up key' directly." The agent cannot do this because it has no persistent memory of what went wrong.

RL's answer and its cost: Standard reinforcement learning addresses this with policy gradient updates. After each failure, you compute a reward signal, back-propagate gradients through billions of parameters, and nudge the weights. That works, but it costs thousands of GPU-hours per task, requires a differentiable reward signal, and produces a model that can only be interrogated as a black box — you can't ask it why it failed. Reflexion's answer: let the agent write down what went wrong in plain English, store it, and read it next time.

Three tasks make this concrete in the Reflexion paper:

Each task provides a natural evaluation: success or failure with a binary signal. Reflexion uses that signal to trigger self-reflection. The agent fails, writes a critique, and tries again — up to a fixed number of trials. It is learning without learning — improving performance without touching a single weight.

The No-Memory Problem

An agent trying a task repeatedly without memory repeats the same mistakes. Click "Attempt" to simulate a naive agent — watch it loop. Then enable Reflexion memory and see how each failure improves the next try.

Why does a standard LLM agent repeat the same mistakes across task attempts?

Chapter 1: Self-Reflection

After a failed attempt, Reflexion asks the agent to do something humans do naturally: think about what went wrong. Given the task, the trajectory (every action and observation from the failed attempt), and the final outcome, the agent produces a verbal reflection — a free-text diagnosis of the failure and a strategy for the next attempt.

The analogy: A chess player who just lost replays the game in their head: "I should have traded bishops earlier. I got too focused on the kingside attack and missed the pawn structure collapsing." The reflection is not a new move — it's an updated mental model that shapes the next game. Reflexion is this, for LLM agents. The "mental model update" is a paragraph of English, stored in memory.

What makes a good reflection? The paper doesn't specify a fixed format — it uses few-shot examples of (trajectory, outcome) → reflection pairs to teach the LLM what a useful critique looks like. Good reflections tend to:

The reflection is written by the same LLM that performed the task — there is no separate critic model. This works because LLMs have excellent meta-cognitive capabilities: they've seen enormous amounts of human writing about mistakes, corrections, and strategies. Given a failure, they can identify plausible explanations and better approaches.

reflection = LLM(task ‖ trajectoryfailed ‖ outcome ‖ few-shot-examples)
python (pseudocode)
def generate_reflection(task, trajectory, outcome, llm):
    # Build the reflection prompt
    prompt = REFLECT_TEMPLATE.format(
        task=task,
        trajectory=format_traj(trajectory),   # all thought/action/obs
        outcome=outcome,                          # e.g., "FAILED: wrong answer"
        examples=FEW_SHOT_REFLECTIONS            # 1-3 examples of good critiques
    )
    # LLM generates verbal critique
    reflection = llm.generate(prompt)
    return reflection  # e.g., "I searched too broadly. Next time focus on..."
What is a verbal reflection in Reflexion, and who generates it?

Chapter 2: The Verbal RL Loop — Interactive

Reflexion wraps any base agent (like ReAct) in an outer loop. The inner loop is the agent solving the task — taking actions, observing outcomes, thinking through steps. The outer loop handles failure: generate a reflection, store it in memory, reset the environment, and try again.

Watch the full loop unfold below. Each trial shows what the agent does differently because of its accumulated reflections. You can control how many trials run and see the memory fill up.

Reflexion Loop — Live Simulation

A coding task: write a function to find the second largest element. Watch the agent fail, reflect, and improve. Each trial's memory panel shows what reflections are carried forward.

Full data flow: Trial 1 context = [task description + few-shot examples]. Agent generates solution → fails tests → reflection is generated from (task, full trajectory, test output). Trial 2 context = [task + few-shot + reflection from trial 1]. Agent generates improved solution. Trial N context includes all previous reflections, summarized if too long. The LLM's weights never change — only the context grows richer with each trial.
python
def reflexion_agent(task, env, llm, max_trials=4):
    memory = []  # persistent verbal memory across trials

    for trial in range(max_trials):
        # Build context: task + all past reflections
        context = build_context(task, memory, FEW_SHOT)

        # Run inner agent (e.g. ReAct) to completion
        trajectory, success = run_agent(llm, env, context)

        if success:
            return trajectory  # done

        # Generate verbal reflection from failed trajectory
        reflection = reflect(llm, task, trajectory)
        memory.append(reflection)  # accumulate in memory
        env.reset()               # reset environment, not memory

    return "Max trials reached"
What is the key difference between what resets between trials and what persists?

Chapter 3: Memory

Reflexion uses three types of memory, which the paper calls short-term, long-term, and episodic. Understanding the distinction helps clarify both how Reflexion works and where it can fail.

Short-term memory is the agent's context window during a single trial — the current task description, the few-shot examples, and the accumulated thought/action/observation trace. This is what any LLM agent has. It's erased when the trial ends.
Episodic memory is the log of verbal reflections from past trials. This is Reflexion's key addition. After each failed trial, the reflection is appended to an episodic memory store. The next trial's context begins with this memory: "In a previous attempt, I failed because X. My strategy this time is Y." Episodic memory is what makes Reflexion more than a single-shot agent.
Long-term memory is the LLM's pre-trained weights — everything it knows from training. Reflexion never modifies this. Long-term memory is fixed; episodic memory is the learning signal.

Memory Limits and What to Do About Them

The context window is finite. With long reflections and many trials, the episodic memory can overflow. Reflexion handles this with a simple priority: keep the most recent reflections, drop older ones if needed. Alternatively, a summarization step can compress multiple reflections into one.

Memory TypeWhat It StoresLifetimeEditable?
Short-termCurrent trial trajectoryOne trialVia generation
EpisodicVerbal reflections from past trialsAll trialsVia appending
Long-termPre-trained LLM weightsPermanentNo (Reflexion doesn't fine-tune)

The episodic memory architecture also explains why Reflexion scales gracefully: storing a reflection costs only tokens, not compute. Adding a third trial's reflection is as cheap as adding a sentence to a document. Compare this to RL, where each additional episode requires a full forward+backward pass through the model.

Memory Accumulation

Visualize how episodic memory grows across trials. Each row is one trial's context. The memory section shows reflections carried forward — each trial sees all prior critiques.

Trials 1
What is episodic memory in Reflexion, and how does it differ from the model's long-term memory?

Chapter 4: vs Fine-Tuning

The natural competitor for Reflexion is fine-tuning: update the model's weights using the failed trajectory as a negative training signal (or the successful trajectory as a positive one). This is the standard RL approach. Reflexion deliberately avoids this — why?

Three reasons Reflexion doesn't fine-tune: First, access: you often don't have access to the model's weights. GPT-4, Claude, Gemini — these are API-only. Fine-tuning isn't an option. Second, cost: one fine-tuning run on a modern LLM costs orders of magnitude more than generating a paragraph of reflection text. Third, catastrophic forgetting: fine-tuning on specific task failures can degrade the model's performance on other tasks. Reflexion's context-only approach has zero risk of forgetting.

But Reflexion has its own failure modes compared to fine-tuning:

PropertyFine-TuningReflexion
Requires model weightsYesNo (API-only works)
Compute cost per trialVery high (full backward pass)Low (text generation)
Knowledge persistencePermanent (in weights)Context-window-limited
Cross-task transferYes (if trained on many tasks)No (per-task memory)
Risk of forgettingYes (catastrophic forgetting)None
InterpretabilityBlack box weightsHuman-readable reflections
The sweet spot: Reflexion is best for tasks where you need to improve quickly with a small number of trials, using an API-only model, on a specific task instance. Fine-tuning is better when you need general improvement across a class of tasks and have the compute budget and model access. In practice, the two can be combined: fine-tune with successful Reflexion trajectories as positive examples.
What is the most practical advantage of Reflexion over fine-tuning for most users?

Chapter 5: Applications

Reflexion was evaluated on three very different task families. The diversity is the point: verbal self-reflection is a general mechanism that works wherever (1) there is a clear success/failure signal and (2) the failure modes can be diagnosed in words.

Sequential Decision Making: AlfWorld

AlfWorld is a text-based household environment. 134 test tasks from 6 categories: pick up, put in, cool, heat, clean, examine. The agent must navigate rooms, interact with objects, and achieve multi-step goals in natural language. Failure modes include wrong room, wrong object, wrong sequence.

Reflexion's reflections on AlfWorld failures look like: "I picked up the pan before heating it, but the task requires heating a pan already in the microwave. I should first identify which pan is already placed, then operate the appliance." This kind of sequential reasoning correction transfers directly to the next attempt.

Reasoning: HotpotQA

HotpotQA requires multi-hop reasoning over Wikipedia. Failure mode: searching too broadly or in the wrong direction. Reflexion reflects: "My search for 'French revolution leaders' was too broad. I need the specific person who was executed on Thermidor 9 — searching 'Robespierre execution date' directly would have worked." The next trial uses a targeted search strategy.

Programming: HumanEval + LeetCode

Code generation is perhaps the most natural fit. Unit tests provide a precise, binary signal. When a test fails, the failure message contains the exact input, expected output, and actual output. A reflection might say: "My solution fails for empty lists because I don't check for the edge case before calling max(). I need to add an early return for len(nums) == 0." The next attempt includes this check.

Why programming is the hardest test: Coding failures are diverse — off-by-one errors, missing edge cases, wrong algorithm choice, type errors. If verbal self-reflection works here, it's doing something genuinely useful, not just pattern-matching to obvious corrections. The paper shows Reflexion achieves 91% pass@1 on HumanEval with GPT-4 — compared to 80% for GPT-4 alone. That 11% gap comes entirely from reflection.
Reflection Quality by Task Type

See example reflection text for each domain. The specificity and actionability of the reflection determines whether it helps the next attempt.

Why is programming one of the strongest applications of Reflexion?

Chapter 6: Results

Across all three task families, Reflexion shows consistent improvement over the base agent (ReAct or simple prompting) with only 1-4 additional trials.

TaskMetricBase AgentReflexion (trials=4)Gain
AlfWorld (avg)Success Rate71%97%+26%
HotpotQAEM Score30%37%+7%
HumanEvalpass@1 (GPT-4)80%91%+11%
LeetCode (easy)pass@140%57%+17%

The AlfWorld improvement is the most striking. A +26% absolute gain from verbal self-reflection alone, across 134 household tasks. The paper attributes this to AlfWorld's structured failure modes: the same types of mistakes (going to the wrong room, not recognizing synonyms for objects) recur across tasks. Reflexion identifies these patterns and eliminates them.

Sample efficiency: Most of Reflexion's gain comes from the first reflection (trial 1 → trial 2). By trial 4, returns diminish. This suggests verbal self-reflection is most valuable for identifying clear, correctable mistakes — not for systematically improving on hard problems. For the hard tail of tasks that Reflexion still fails on after 4 trials, a different approach (more compute, better tools, fine-tuning) is needed.

Failure analysis from the paper: tasks where Reflexion fails tend to involve hallucinated reflections ("I think I failed because X" where X is wrong), multi-step compounding errors where no single reflection captures the root cause, and tasks requiring genuine knowledge the model simply doesn't have. Verbal reflection can correct strategy; it cannot supply missing knowledge.

Trial-by-Trial Success Rate

Success rate across trials for each task domain. Most of the gain comes from the first reflection.

Domain AlfWorld
When does Reflexion fail to improve performance, even after multiple trials?

Chapter 7: Connections

Reflexion sits at the intersection of RL theory, agentic AI, and cognitive science. Its descendants and relatives reveal how the field has been building toward language-native learning.

MethodKey Ideavs Reflexion
ReAct (Yao 2022)Interleave thought + actionReflexion's inner loop; Reflexion adds outer loop + memory
Reflexion (this paper)Verbal critique → episodic memory → next trial
Self-Refine (Madaan 2023)Critique and refine a single output iterativelySingle-shot refinement; Reflexion uses environment feedback across episodes
LATS (Zhou 2023)Monte Carlo tree search over agent trajectoriesReflexion: linear retry. LATS: tree search with backtracking.
Voyager (Wang 2023)Skill library grows with reflectionReflexion adds memories; Voyager adds reusable code skills
Constitutional AI (Bai 2022)Self-critique for alignmentDifferent goal (safety) same mechanism (self-evaluation + revision)
The broader principle: Reflexion is an instance of in-context learning from feedback — the observation that LLMs can improve at a task within a single context window given structured feedback, without any parameter updates. This connects to earlier work on few-shot learning (Brown et al. 2020) and tool use (Schick 2023): the model's in-context flexibility is an underexplored axis of capability.
Key limitations: (1) Context window ceiling: once the reflection history fills the window, no more learning is possible. (2) Hallucinated diagnoses: the model may misidentify why it failed, storing a wrong correction that hurts future trials. (3) No cross-task transfer: starting a new task clears all memories. (4) Evaluation dependence: Reflexion needs a clear success signal — it cannot improve on tasks with ambiguous outcomes. (5) Deterministic failure: if the task is genuinely beyond the model's capabilities, repeated reflection won't help.

Go Deeper

  • ReAct (2022) — the inner loop Reflexion wraps
  • Self-Refine (Madaan et al. 2023) — single-shot iterative critique
  • LATS (Zhou et al. 2023) — tree search over agent trajectories

Key Paper

Shinn, Cassano, Labash, Gopalan, Narasimhan, Yao. "Reflexion: Language Agents with Verbal Reinforcement Learning." NeurIPS 2023. arXiv:2303.11366

"The ability to learn from failure in language — without updating a single weight — is perhaps the most human thing a language model has ever done."