Reflexion: Verbal Reinforcement Learning

Chapter 0: The Problem

You're playing a text-based game. You try to pick up a key, but instead you walk into a wall. You try again — same result. The agent has no memory of its failure. Each attempt starts fresh, making the same mistake in the same room with the same wrong action. It's learning nothing.

This is the fundamental problem with single-episode LLM agents. Each time the task resets, so does the agent's knowledge. A human playing the same game would think: "Last time I walked north when I meant to pick up the key — I should try 'pick up key' directly." The agent cannot do this because it has no persistent memory of what went wrong.

RL's answer and its cost: Standard reinforcement learning addresses this with policy gradient updates. After each failure, you compute a reward signal, back-propagate gradients through billions of parameters, and nudge the weights. That works, but it costs thousands of GPU-hours per task, requires a differentiable reward signal, and produces a model that can only be interrogated as a black box — you can't ask it why it failed. Reflexion's answer: let the agent write down what went wrong in plain English, store it, and read it next time.

Three tasks make this concrete in the Reflexion paper:

Decision making (AlfWorld): A text-based household environment. "Go to the desk. Pick up the pencil. Put it in the drawer." Many objects, many rooms, many ways to fail.
Reasoning (HotpotQA): Multi-hop factual questions. Wrong search strategy leads to wrong answer. Post-hoc critique identifies which search was too broad or too narrow.
Programming (HumanEval, LeetCode): Write code that passes unit tests. Test results are the feedback signal. Reflection identifies which test failed and why.

Each task provides a natural evaluation: success or failure with a binary signal. Reflexion uses that signal to trigger self-reflection. The agent fails, writes a critique, and tries again — up to a fixed number of trials. It is learning without learning — improving performance without touching a single weight.

The No-Memory Problem

An agent trying a task repeatedly without memory repeats the same mistakes. Click "Attempt" to simulate a naive agent — watch it loop. Then enable Reflexion memory and see how each failure improves the next try.

Why does a standard LLM agent repeat the same mistakes across task attempts?

Each episode starts with a fresh context — without persistent memory of past failures, the agent has no information about what went wrong and cannot adjust its strategy LLMs cannot process feedback signals of any kind The task environment resets the agent's weights between episodes

Chapter 1: Self-Reflection

After a failed attempt, Reflexion asks the agent to do something humans do naturally: think about what went wrong. Given the task, the trajectory (every action and observation from the failed attempt), and the final outcome, the agent produces a verbal reflection — a free-text diagnosis of the failure and a strategy for the next attempt.

The analogy: A chess player who just lost replays the game in their head: "I should have traded bishops earlier. I got too focused on the kingside attack and missed the pawn structure collapsing." The reflection is not a new move — it's an updated mental model that shapes the next game. Reflexion is this, for LLM agents. The "mental model update" is a paragraph of English, stored in memory.

What makes a good reflection? The paper doesn't specify a fixed format — it uses few-shot examples of (trajectory, outcome) → reflection pairs to teach the LLM what a useful critique looks like. Good reflections tend to:

Identify the specific action that caused failure ("I searched for 'climate change causes' too broadly instead of 'CO2 warming mechanism evidence'").
Propose a concrete alternative ("Next time, search for the specific mechanism, not the general topic").
Avoid generic statements ("I should try harder") that carry no actionable information.

The reflection is written by the same LLM that performed the task — there is no separate critic model. This works because LLMs have excellent meta-cognitive capabilities: they've seen enormous amounts of human writing about mistakes, corrections, and strategies. Given a failure, they can identify plausible explanations and better approaches.

reflection = LLM(task ‖ trajectory_failed ‖ outcome ‖ few-shot-examples)

python (pseudocode)
def generate_reflection(task, trajectory, outcome, llm):
    # Build the reflection prompt
    prompt = REFLECT_TEMPLATE.format(
        task=task,
        trajectory=format_traj(trajectory),   # all thought/action/obs
        outcome=outcome,                          # e.g., "FAILED: wrong answer"
        examples=FEW_SHOT_REFLECTIONS            # 1-3 examples of good critiques
    )
    # LLM generates verbal critique
    reflection = llm.generate(prompt)
    return reflection  # e.g., "I searched too broadly. Next time focus on..."

What is a verbal reflection in Reflexion, and who generates it?

A free-text diagnosis of why the attempt failed and a strategy for improvement — generated by the same LLM that performed the task, given the full trajectory and outcome as context A numerical reward signal computed by a separate evaluator model A gradient update generated by back-propagating through the trajectory

Chapter 2: The Verbal RL Loop — Interactive

Reflexion wraps any base agent (like ReAct) in an outer loop. The inner loop is the agent solving the task — taking actions, observing outcomes, thinking through steps. The outer loop handles failure: generate a reflection, store it in memory, reset the environment, and try again.

Watch the full loop unfold below. Each trial shows what the agent does differently because of its accumulated reflections. You can control how many trials run and see the memory fill up.

Reflexion Loop — Live Simulation

A coding task: write a function to find the second largest element. Watch the agent fail, reflect, and improve. Each trial's memory panel shows what reflections are carried forward.

Full data flow: Trial 1 context = [task description + few-shot examples]. Agent generates solution → fails tests → reflection is generated from (task, full trajectory, test output). Trial 2 context = [task + few-shot + reflection from trial 1]. Agent generates improved solution. Trial N context includes all previous reflections, summarized if too long. The LLM's weights never change — only the context grows richer with each trial.

python
def reflexion_agent(task, env, llm, max_trials=4):
    memory = []  # persistent verbal memory across trials

    for trial in range(max_trials):
        # Build context: task + all past reflections
        context = build_context(task, memory, FEW_SHOT)

        # Run inner agent (e.g. ReAct) to completion
        trajectory, success = run_agent(llm, env, context)

        if success:
            return trajectory  # done

        # Generate verbal reflection from failed trajectory
        reflection = reflect(llm, task, trajectory)
        memory.append(reflection)  # accumulate in memory
        env.reset()               # reset environment, not memory

    return "Max trials reached"

What is the key difference between what resets between trials and what persists?

The environment resets (task starts fresh) but the memory of verbal reflections persists — each trial sees all previous failure critiques in its context Both the environment and the memory reset — each trial is completely independent The memory resets but the model's weights are updated to reflect the failure

Chapter 3: Memory

Reflexion uses three types of memory, which the paper calls short-term, long-term, and episodic. Understanding the distinction helps clarify both how Reflexion works and where it can fail.

Short-term memory is the agent's context window during a single trial — the current task description, the few-shot examples, and the accumulated thought/action/observation trace. This is what any LLM agent has. It's erased when the trial ends.

Episodic memory is the log of verbal reflections from past trials. This is Reflexion's key addition. After each failed trial, the reflection is appended to an episodic memory store. The next trial's context begins with this memory: "In a previous attempt, I failed because X. My strategy this time is Y." Episodic memory is what makes Reflexion more than a single-shot agent.

Long-term memory is the LLM's pre-trained weights — everything it knows from training. Reflexion never modifies this. Long-term memory is fixed; episodic memory is the learning signal.

Memory Limits and What to Do About Them

The context window is finite. With long reflections and many trials, the episodic memory can overflow. Reflexion handles this with a simple priority: keep the most recent reflections, drop older ones if needed. Alternatively, a summarization step can compress multiple reflections into one.

Memory Type	What It Stores	Lifetime	Editable?
Short-term	Current trial trajectory	One trial	Via generation
Episodic	Verbal reflections from past trials	All trials	Via appending
Long-term	Pre-trained LLM weights	Permanent	No (Reflexion doesn't fine-tune)

The episodic memory architecture also explains why Reflexion scales gracefully: storing a reflection costs only tokens, not compute. Adding a third trial's reflection is as cheap as adding a sentence to a document. Compare this to RL, where each additional episode requires a full forward+backward pass through the model.

Memory Accumulation

Visualize how episodic memory grows across trials. Each row is one trial's context. The memory section shows reflections carried forward — each trial sees all prior critiques.

Trials 1

What is episodic memory in Reflexion, and how does it differ from the model's long-term memory?

Episodic memory is the accumulation of verbal reflections from past failed trials, stored as text in the context. Long-term memory is the pre-trained LLM weights, which Reflexion never modifies. Episodic memory stores the entire action history as vectors; long-term memory stores the reflections as weights Both episodic and long-term memory are stored as model weights — Reflexion fine-tunes the model after each trial

Chapter 4: vs Fine-Tuning

The natural competitor for Reflexion is fine-tuning: update the model's weights using the failed trajectory as a negative training signal (or the successful trajectory as a positive one). This is the standard RL approach. Reflexion deliberately avoids this — why?

Three reasons Reflexion doesn't fine-tune: First, access: you often don't have access to the model's weights. GPT-4, Claude, Gemini — these are API-only. Fine-tuning isn't an option. Second, cost: one fine-tuning run on a modern LLM costs orders of magnitude more than generating a paragraph of reflection text. Third, catastrophic forgetting: fine-tuning on specific task failures can degrade the model's performance on other tasks. Reflexion's context-only approach has zero risk of forgetting.

But Reflexion has its own failure modes compared to fine-tuning:

Context window limit: After enough trials, the reflections fill the context. Fine-tuning encodes knowledge in weights, which don't have this limit.
Cross-task transfer: Reflexion's reflections are task-specific. Fine-tuning on 1000 similar tasks builds general knowledge that transfers. Reflexion restarts from scratch on a new task.
Hallucinated reflections: The LLM might generate an incorrect diagnosis of why it failed. Fine-tuning uses actual gradient signals — the loss function is an honest measure of error.

Property	Fine-Tuning	Reflexion
Requires model weights	Yes	No (API-only works)
Compute cost per trial	Very high (full backward pass)	Low (text generation)
Knowledge persistence	Permanent (in weights)	Context-window-limited
Cross-task transfer	Yes (if trained on many tasks)	No (per-task memory)
Risk of forgetting	Yes (catastrophic forgetting)	None
Interpretability	Black box weights	Human-readable reflections

The sweet spot: Reflexion is best for tasks where you need to improve quickly with a small number of trials, using an API-only model, on a specific task instance. Fine-tuning is better when you need general improvement across a class of tasks and have the compute budget and model access. In practice, the two can be combined: fine-tune with successful Reflexion trajectories as positive examples.

What is the most practical advantage of Reflexion over fine-tuning for most users?

Reflexion works with API-only models where the weights are inaccessible, costs only the price of text generation, and risks no catastrophic forgetting Reflexion always outperforms fine-tuning in accuracy across all tasks Reflexion cross-task transfer is better than fine-tuning because reflections are general-purpose

Chapter 5: Applications

Reflexion was evaluated on three very different task families. The diversity is the point: verbal self-reflection is a general mechanism that works wherever (1) there is a clear success/failure signal and (2) the failure modes can be diagnosed in words.

Sequential Decision Making: AlfWorld

AlfWorld is a text-based household environment. 134 test tasks from 6 categories: pick up, put in, cool, heat, clean, examine. The agent must navigate rooms, interact with objects, and achieve multi-step goals in natural language. Failure modes include wrong room, wrong object, wrong sequence.

Reflexion's reflections on AlfWorld failures look like: "I picked up the pan before heating it, but the task requires heating a pan already in the microwave. I should first identify which pan is already placed, then operate the appliance." This kind of sequential reasoning correction transfers directly to the next attempt.

Reasoning: HotpotQA

HotpotQA requires multi-hop reasoning over Wikipedia. Failure mode: searching too broadly or in the wrong direction. Reflexion reflects: "My search for 'French revolution leaders' was too broad. I need the specific person who was executed on Thermidor 9 — searching 'Robespierre execution date' directly would have worked." The next trial uses a targeted search strategy.

Programming: HumanEval + LeetCode

Code generation is perhaps the most natural fit. Unit tests provide a precise, binary signal. When a test fails, the failure message contains the exact input, expected output, and actual output. A reflection might say: "My solution fails for empty lists because I don't check for the edge case before calling max(). I need to add an early return for len(nums) == 0." The next attempt includes this check.

Why programming is the hardest test: Coding failures are diverse — off-by-one errors, missing edge cases, wrong algorithm choice, type errors. If verbal self-reflection works here, it's doing something genuinely useful, not just pattern-matching to obvious corrections. The paper shows Reflexion achieves 91% pass@1 on HumanEval with GPT-4 — compared to 80% for GPT-4 alone. That 11% gap comes entirely from reflection.

Reflection Quality by Task Type

See example reflection text for each domain. The specificity and actionability of the reflection determines whether it helps the next attempt.

Why is programming one of the strongest applications of Reflexion?

Unit tests provide a precise failure signal with exact input/output mismatches — the LLM can diagnose specific bugs from the test output and generate actionable corrections, not generic advice LLMs are better at code generation than any other task Programming tasks always succeed on the second attempt after reflection

Chapter 6: Results

Across all three task families, Reflexion shows consistent improvement over the base agent (ReAct or simple prompting) with only 1-4 additional trials.

Task	Metric	Base Agent	Reflexion (trials=4)	Gain
AlfWorld (avg)	Success Rate	71%	97%	+26%
HotpotQA	EM Score	30%	37%	+7%
HumanEval	pass@1 (GPT-4)	80%	91%	+11%
LeetCode (easy)	pass@1	40%	57%	+17%

The AlfWorld improvement is the most striking. A +26% absolute gain from verbal self-reflection alone, across 134 household tasks. The paper attributes this to AlfWorld's structured failure modes: the same types of mistakes (going to the wrong room, not recognizing synonyms for objects) recur across tasks. Reflexion identifies these patterns and eliminates them.

Sample efficiency: Most of Reflexion's gain comes from the first reflection (trial 1 → trial 2). By trial 4, returns diminish. This suggests verbal self-reflection is most valuable for identifying clear, correctable mistakes — not for systematically improving on hard problems. For the hard tail of tasks that Reflexion still fails on after 4 trials, a different approach (more compute, better tools, fine-tuning) is needed.

Failure analysis from the paper: tasks where Reflexion fails tend to involve hallucinated reflections ("I think I failed because X" where X is wrong), multi-step compounding errors where no single reflection captures the root cause, and tasks requiring genuine knowledge the model simply doesn't have. Verbal reflection can correct strategy; it cannot supply missing knowledge.

Trial-by-Trial Success Rate

Success rate across trials for each task domain. Most of the gain comes from the first reflection.

Domain AlfWorld

When does Reflexion fail to improve performance, even after multiple trials?

When the LLM generates an incorrect diagnosis of the failure (hallucinated reflection), or when the task requires factual knowledge the model doesn't have — reflection can fix strategy but not supply missing knowledge When the task uses more than 3 hops of reasoning Reflexion always improves performance given enough trials

Chapter 7: Connections

Reflexion sits at the intersection of RL theory, agentic AI, and cognitive science. Its descendants and relatives reveal how the field has been building toward language-native learning.

Method	Key Idea	vs Reflexion
ReAct (Yao 2022)	Interleave thought + action	Reflexion's inner loop; Reflexion adds outer loop + memory
Reflexion (this paper)	Verbal critique → episodic memory → next trial	—
Self-Refine (Madaan 2023)	Critique and refine a single output iteratively	Single-shot refinement; Reflexion uses environment feedback across episodes
LATS (Zhou 2023)	Monte Carlo tree search over agent trajectories	Reflexion: linear retry. LATS: tree search with backtracking.
Voyager (Wang 2023)	Skill library grows with reflection	Reflexion adds memories; Voyager adds reusable code skills
Constitutional AI (Bai 2022)	Self-critique for alignment	Different goal (safety) same mechanism (self-evaluation + revision)

The broader principle: Reflexion is an instance of in-context learning from feedback — the observation that LLMs can improve at a task within a single context window given structured feedback, without any parameter updates. This connects to earlier work on few-shot learning (Brown et al. 2020) and tool use (Schick 2023): the model's in-context flexibility is an underexplored axis of capability.

Key limitations: (1) Context window ceiling: once the reflection history fills the window, no more learning is possible. (2) Hallucinated diagnoses: the model may misidentify why it failed, storing a wrong correction that hurts future trials. (3) No cross-task transfer: starting a new task clears all memories. (4) Evaluation dependence: Reflexion needs a clear success signal — it cannot improve on tasks with ambiguous outcomes. (5) Deterministic failure: if the task is genuinely beyond the model's capabilities, repeated reflection won't help.

Go Deeper

ReAct (2022) — the inner loop Reflexion wraps
Self-Refine (Madaan et al. 2023) — single-shot iterative critique
LATS (Zhou et al. 2023) — tree search over agent trajectories

Key Paper

Shinn, Cassano, Labash, Gopalan, Narasimhan, Yao. "Reflexion: Language Agents with Verbal Reinforcement Learning." NeurIPS 2023. arXiv:2303.11366

"The ability to learn from failure in language — without updating a single weight — is perhaps the most human thing a language model has ever done."