Yao, Zhao, Yu, Du, Shafran, Narasimhan, Cao — 2022

ReAct: Synergizing Reasoning and Acting

Language models that think before they act — and act to think better. ReAct interleaves internal reasoning traces with external tool use, creating a feedback loop that beats both pure reasoning and pure tool use alone.

Prerequisites: What an LLM is + Basic prompt engineering intuition. No implementation experience required.
8
Chapters
4+
Simulations
2210.03629
arXiv

Chapter 0: The Problem

Ask an LLM: "Was the 2004 Summer Olympics held in a city that also hosted a previous Olympics?" The model needs to know where the 2004 Olympics were held (Athens), whether Athens hosted before (1896), and combine those facts correctly. If its training data is slightly off, it will confidently give a wrong answer with no way to self-correct.

Now try the other extreme: a pure tool-using agent. It gets a Wikipedia search tool. It searches "2004 Summer Olympics," reads the page, searches "Athens Olympics history," reads another page, maybe searches a third thing. No reasoning between steps. It either finds the answer or gets lost in irrelevant documents. There's no internal deliberation to guide which tool to call next.

Two failure modes that cancel out: Pure reasoning (Chain-of-Thought) can be confidently wrong — it hallucinates facts it doesn't have. Pure acting (tool-calling without reasoning) is reactive and unguided — it doesn't know when it has enough information or when it's going in circles. Neither alone is robust. ReAct combines them: reason before acting (to plan the right action) and act before reasoning (to ground thoughts in real evidence).

The specific tasks where this matters most:

These tasks are hard because they require maintaining a coherent goal across many steps, knowing when to search for more information, and being able to revise a plan when new evidence contradicts earlier assumptions. Pure reasoning fails at the first; pure acting fails at the second and third.

Reasoning vs Acting Failure Modes

Two agents trying to answer "In what year did the director of 'Bambi' also direct 'Pinocchio'?" Click to step through each agent's process and see where they go wrong.

Reasoning-only
Why does Chain-of-Thought (reasoning only) fail on multi-hop factual questions?

Chapter 1: The ReAct Loop

ReAct produces a trajectory of interleaved thoughts and actions. A thought is a free-text reasoning trace — the model thinking out loud. An action is a call to an external tool (search, lookup, finish). An observation is the tool's response, fed back into the context. Then another thought, then another action, until the task is complete.

Thought
"I need to find who directed Bambi. Let me search for Bambi film."
Action
Search[Bambi (film)]
Observation
"Bambi is a 1942 animated film directed by David Hand..."
Thought
"David Hand directed Bambi. Now I need to know if he also directed Pinocchio."
Action
Search[Pinocchio (1940 film)]
Observation
"Pinocchio was directed by Ben Sharpsteen and Hamilton Luske..."
Thought
"David Hand did NOT direct Pinocchio. So there's no year. The premise is wrong."
Action
Finish[The premise is incorrect — David Hand directed Bambi but not Pinocchio]
The key insight: Thoughts create a working memory. Between tool calls, the model can record what it has learned, what it needs next, and what to do if the next search fails. This is like a detective's notebook — each thought updates the case theory; each action gathers evidence. Pure acting has no notebook. Pure reasoning has no new evidence to write in it.

The action space in the original ReAct paper (Wikipedia-based tasks) is three actions: Search[entity] retrieves the Wikipedia intro for entity; Lookup[term] finds the next occurrence of a term in the current page; and Finish[answer] terminates with an answer. Thoughts are unconstrained free text. The model decides everything in natural language — when to think, when to act, and what to say in each.

What is a "thought" in the ReAct framework, and why does it improve over pure acting?

Chapter 2: Thought → Action → Observation — Interactive

Let's watch ReAct solve a real multi-hop question step by step. The agent must answer: "Were Scott Derrickson and Ed Wood involved in the same genre of films?"

You control the pace. Each step reveals the next thought (reasoning), action (tool call), or observation (tool response). Watch how the thoughts guide the actions and how each observation updates the plan.

ReAct Trace — Step-by-Step Walkthrough

Click "Next Step" to advance through the agent's reasoning. Each box is color-coded: orange=Thought, teal=Action, blue=Observation.

Full data flow: Prompt + few-shot examples + current trajectory → LLM → generates next token sequence until it produces an action token → action is parsed → tool is called → observation is appended to context → repeat. The model is doing in-context learning: it sees examples of (thought, action, observation) chains in the few-shot prompt and generates similar chains for new questions. No fine-tuning — just prompting.
python (pseudocode)
def react_agent(question, tools, max_steps=10):
    context = few_shot_examples + f"Question: {question}\n"

    for step in range(max_steps):
        # Model generates next thought/action
        output = llm.generate(context, stop=["\nObservation"])

        if "Finish[" in output:
            answer = parse_finish(output)
            return answer

        # Parse and execute the action
        action_type, arg = parse_action(output)  # e.g., Search[Paris]
        obs = tools[action_type](arg)  # call Wikipedia API, etc.

        # Append thought+action+observation to context
        context += output + f"\nObservation: {obs}\n"

    return "Max steps reached"
How does ReAct handle the case where a search returns irrelevant information?

Chapter 3: vs Chain-of-Thought

Chain-of-Thought (CoT) prompting is the most direct baseline: ask the model to "think step by step" before answering. It works well for math problems and logical deductions where the model's training contains the relevant knowledge. For factual multi-hop questions, CoT's Achilles heel is hallucination.

CoT + Hallucination = Confident Wrong Answer. CoT generates a coherent chain of steps. But if step 1 states a wrong fact ("The 2004 Olympics were in Sydney"), steps 2-5 build logically on that wrong premise and produce a confident, wrong answer. The reasoning is internally consistent but factually grounded in nothing. The model cannot tell the difference between a recalled fact and a confabulated one.

ReAct addresses this with two mechanisms:

PropertyChain-of-ThoughtReAct
Factual groundingTraining data onlyLive tool calls
Hallucination riskHigh on obscure factsLow (verified at each step)
Self-correctionNone (one-shot)Yes (observation feedback)
TransparencyYes (reasoning visible)Yes (full trace)
Reasoning capabilityFull LLM reasoningFull LLM reasoning
Works offlineYesNo (needs tools)

Interestingly, ReAct can be combined with CoT: the CoT → ReAct strategy uses CoT when the model is confident (straightforward reasoning) and switches to ReAct when it needs to verify a fact. The paper calls this the "self-consistency + ReAct" combination and shows it achieves the best of both worlds.

What is Chain-of-Thought's key failure mode on factual multi-hop questions?

Chapter 4: vs Act-Only

The other baseline: a pure tool-using agent with no reasoning traces. For each step, the model just outputs an action directly. Call it Act-only. It can call the same tools as ReAct but produces no thoughts between observations.

Act-only's failure modes are the mirror image of CoT's:

Thoughts as working memory. The thought steps in ReAct serve the same function as a scratch pad in a math problem: they externalize the agent's current state of understanding so that future reasoning steps can build on it. Act-only has no scratch pad. Each action is chosen with only the raw observations in context — no digest, no synthesis, no plan.
PropertyAct-OnlyReAct
External knowledgeYes (tools)Yes (tools)
Planning between stepsImplicit onlyExplicit thought trace
Loop detectionNoYes (thought records what was tried)
Sub-goal trackingFragileExplicit in thought steps
InterpretabilityLow (opaque action sequence)High (full reasoning visible)
Token costLowerHigher (thoughts add tokens)
What is Act-only's most common failure mode on multi-hop tasks?

Chapter 5: Tool Integration

ReAct is tool-agnostic. The paper demonstrates it on Wikipedia search, but the pattern applies to any tool with a text input and text output: calculators, code interpreters, database queries, web browsers, calendar APIs, sensor readings. The model generates an action in natural language; a parser routes it to the right tool; the tool returns an observation in natural language; the observation goes back into context.

The Original Tool Suite (Wikipedia Tasks)

Why so minimal a tool suite? The paper deliberately uses only 3 simple tools to prove that the ReAct pattern's power comes from the reasoning-acting interleaving, not from tool richness. Modern agentic systems (GPT-4 with tools, Claude with computer use, Gemini with extensions) extend this with dozens of tools, but the core loop is identical: Thought → Action[tool, arg] → Observation → Thought...

Tool Design for ReAct

Good tools for ReAct agents have three properties:

Action Parser

See how an action string from the LLM is parsed into (tool, argument). Edit the action string and watch the parser extract the components.

What is the minimum requirement for a tool to be usable in the ReAct framework?

Chapter 6: Results

ReAct was evaluated on four benchmarks using PaLM-540B and GPT-3 with few-shot prompting. No fine-tuning — just in-context examples of the thought-action-observation pattern.

TaskDatasetMetricCoTActReActBest
Multi-hop QAHotpotQAEM29.4%25.7%27.4%ReAct+CoT: 35.1%
Fact verif.FEVERAcc.56.3%58.9%60.9%ReAct+CoT: 64.6%
Decision (text)ALFWorldSucc.45%71%ReAct: 71%
Decision (web)WebShopScore49.9%40.0%Fine-tuned: 62%

The pattern is clear: ReAct outperforms Act-only on tasks that require multi-step reasoning (ALFWorld, FEVER). CoT + ReAct combination consistently beats both pure methods. The one exception is WebShop, where Act-only slightly outperforms ReAct — the paper attributes this to WebShop's action space being large and continuous, where generating thoughts adds noise without providing useful grounding.

Qualitative finding — error analysis: The paper analyzed 50 errors from each method on HotpotQA. CoT errors: 56% hallucination (made up a fact), 23% reasoning error (correct facts, wrong deduction), 21% incomplete search. ReAct errors: 47% insufficient search (needed one more tool call), 28% action parsing error (model didn't format the action correctly), 25% reasoning error. ReAct virtually eliminates hallucination errors; its new failure mode is insufficient search depth.
What failure mode does ReAct nearly eliminate compared to Chain-of-Thought?

Chapter 7: Connections

ReAct is one of the foundational papers in what is now called agentic AI — LLMs that act in the world, not just generate text. Understanding its lineage and descendants maps the field.

MethodKey Ideavs ReAct
CoT (Wei et al. 2022)Reason step by stepReAct adds actions to CoT's reasoning
ReAct (this paper)Interleave thought + action
Reflexion (Shinn et al. 2023)Reflect on failure, store in memoryReAct per episode; Reflexion across episodes
Toolformer (Schick 2023)Fine-tune to call APIsReAct: prompting only; Toolformer: weights updated
Tree of Thoughts (Yao 2023)Branch and prune reasoning pathsReAct: linear trace; ToT: tree search over thoughts
SWE-agent (Yang 2024)ReAct loop for code editingDirect descendant with specialized tools
ReAct → Reflexion: ReAct operates within a single episode. Reflexion (next paper) adds an outer loop: after a ReAct episode fails, a separate reflection step generates a verbal critique ("I looped on the same search too many times") and stores it in persistent memory. The next episode's context includes this critique, allowing the agent to avoid the same mistake. Reflexion is ReAct + episodic memory + verbal RL.
Limitations of ReAct: (1) Token cost: thoughts add 30-50 tokens per step; a 10-step ReAct trace costs 300-500 extra tokens. (2) Context length: long traces fill the context window, eventually dropping earlier observations. (3) Parsing fragility: if the model generates a malformed action ("Srch[X]" instead of "Search[X]"), the whole trace breaks. (4) No learning: each episode starts fresh; mistakes in one episode are not remembered in the next (Reflexion fixes this). (5) Tool dependency: degraded or unavailable tools cause cascade failures the reasoning can't recover from.

Go Deeper

  • Reflexion (2023) — verbal reinforcement learning across episodes
  • Chain-of-Thought Prompting (Wei et al. 2022) — the reasoning-only baseline
  • Tree of Thoughts (2023) — branching reasoning with search

Key Paper

Yao et al. "ReAct: Synergizing Reasoning and Acting in Language Models." 2022. arXiv:2210.03629

"Act not just to execute, but to explore. Think not just to plan, but to verify." — the ReAct philosophy