ReAct — Veanors

Chapter 0: The Problem

Ask an LLM: "Was the 2004 Summer Olympics held in a city that also hosted a previous Olympics?" The model needs to know where the 2004 Olympics were held (Athens), whether Athens hosted before (1896), and combine those facts correctly. If its training data is slightly off, it will confidently give a wrong answer with no way to self-correct.

Now try the other extreme: a pure tool-using agent. It gets a Wikipedia search tool. It searches "2004 Summer Olympics," reads the page, searches "Athens Olympics history," reads another page, maybe searches a third thing. No reasoning between steps. It either finds the answer or gets lost in irrelevant documents. There's no internal deliberation to guide which tool to call next.

Two failure modes that cancel out: Pure reasoning (Chain-of-Thought) can be confidently wrong — it hallucinates facts it doesn't have. Pure acting (tool-calling without reasoning) is reactive and unguided — it doesn't know when it has enough information or when it's going in circles. Neither alone is robust. ReAct combines them: reason before acting (to plan the right action) and act before reasoning (to ground thoughts in real evidence).

The specific tasks where this matters most:

Multi-hop QA (HotpotQA): Answer questions that require combining 2+ facts from different documents. "Who was the director of the film that starred the wife of the president who served during the Cuban Missile Crisis?"
Fact verification (FEVER): Classify claims as SUPPORTS/REFUTES/NOT ENOUGH INFO using Wikipedia. Claims are often subtly wrong in ways that require precise evidence.
Interactive decision tasks (ALFWorld, WebShop): Navigate environments where decisions depend on observations from previous steps.

These tasks are hard because they require maintaining a coherent goal across many steps, knowing when to search for more information, and being able to revise a plan when new evidence contradicts earlier assumptions. Pure reasoning fails at the first; pure acting fails at the second and third.

Reasoning vs Acting Failure Modes

Two agents trying to answer "In what year did the director of 'Bambi' also direct 'Pinocchio'?" Click to step through each agent's process and see where they go wrong.

Reasoning-only

Why does Chain-of-Thought (reasoning only) fail on multi-hop factual questions?

It can hallucinate factual premises — the model confidently reasons from incorrect starting assumptions with no way to verify against external knowledge It cannot perform multi-step logical deductions It is too slow to run on multi-hop questions

Chapter 1: The ReAct Loop

ReAct produces a trajectory of interleaved thoughts and actions. A thought is a free-text reasoning trace — the model thinking out loud. An action is a call to an external tool (search, lookup, finish). An observation is the tool's response, fed back into the context. Then another thought, then another action, until the task is complete.

Thought

"I need to find who directed Bambi. Let me search for Bambi film."

↓

Action

Search[Bambi (film)]

↓

Observation

"Bambi is a 1942 animated film directed by David Hand..."

↓

Thought

"David Hand directed Bambi. Now I need to know if he also directed Pinocchio."

↓

Action

Search[Pinocchio (1940 film)]

↓

Observation

"Pinocchio was directed by Ben Sharpsteen and Hamilton Luske..."

↓

Thought

"David Hand did NOT direct Pinocchio. So there's no year. The premise is wrong."

↓

Action

Finish[The premise is incorrect — David Hand directed Bambi but not Pinocchio]

The key insight: Thoughts create a working memory. Between tool calls, the model can record what it has learned, what it needs next, and what to do if the next search fails. This is like a detective's notebook — each thought updates the case theory; each action gathers evidence. Pure acting has no notebook. Pure reasoning has no new evidence to write in it.

The action space in the original ReAct paper (Wikipedia-based tasks) is three actions: Search[entity] retrieves the Wikipedia intro for entity; Lookup[term] finds the next occurrence of a term in the current page; and Finish[answer] terminates with an answer. Thoughts are unconstrained free text. The model decides everything in natural language — when to think, when to act, and what to say in each.

What is a "thought" in the ReAct framework, and why does it improve over pure acting?

A thought is a forced pause step that slows down tool calling A thought is a free-text reasoning trace between actions — it acts as a working memory that records what was learned, decides what to search next, and updates the plan based on new observations A thought is a special API call that queries a separate reasoning model

Chapter 2: Thought → Action → Observation — Interactive

Let's watch ReAct solve a real multi-hop question step by step. The agent must answer: "Were Scott Derrickson and Ed Wood involved in the same genre of films?"

You control the pace. Each step reveals the next thought (reasoning), action (tool call), or observation (tool response). Watch how the thoughts guide the actions and how each observation updates the plan.

ReAct Trace — Step-by-Step Walkthrough

Click "Next Step" to advance through the agent's reasoning. Each box is color-coded: orange=Thought, teal=Action, blue=Observation.

Full data flow: Prompt + few-shot examples + current trajectory → LLM → generates next token sequence until it produces an action token → action is parsed → tool is called → observation is appended to context → repeat. The model is doing in-context learning: it sees examples of (thought, action, observation) chains in the few-shot prompt and generates similar chains for new questions. No fine-tuning — just prompting.

python (pseudocode)
def react_agent(question, tools, max_steps=10):
    context = few_shot_examples + f"Question: {question}\n"

    for step in range(max_steps):
        # Model generates next thought/action
        output = llm.generate(context, stop=["\nObservation"])

        if "Finish[" in output:
            answer = parse_finish(output)
            return answer

        # Parse and execute the action
        action_type, arg = parse_action(output)  # e.g., Search[Paris]
        obs = tools[action_type](arg)  # call Wikipedia API, etc.

        # Append thought+action+observation to context
        context += output + f"\nObservation: {obs}\n"

    return "Max steps reached"

How does ReAct handle the case where a search returns irrelevant information?

The model writes a thought acknowledging the observation was not useful, then decides to search for a different term or use the Lookup action to find a relevant part of the page The observation is discarded and the same search is retried automatically A separate fact-checker model validates observations before adding them to context

Chapter 3: vs Chain-of-Thought

Chain-of-Thought (CoT) prompting is the most direct baseline: ask the model to "think step by step" before answering. It works well for math problems and logical deductions where the model's training contains the relevant knowledge. For factual multi-hop questions, CoT's Achilles heel is hallucination.

CoT + Hallucination = Confident Wrong Answer. CoT generates a coherent chain of steps. But if step 1 states a wrong fact ("The 2004 Olympics were in Sydney"), steps 2-5 build logically on that wrong premise and produce a confident, wrong answer. The reasoning is internally consistent but factually grounded in nothing. The model cannot tell the difference between a recalled fact and a confabulated one.

ReAct addresses this with two mechanisms:

Grounding: Every factual claim can be verified by a tool call. Instead of "I recall that X was directed by Y," the model calls Search[X] and reads the actual Wikipedia answer.
Revision: When a search contradicts the working hypothesis, the thought step explicitly acknowledges this: "Observation says Y, which contradicts my earlier assumption that Z. Let me reconsider." Pure CoT has no mechanism for this — it committed to its first answer in one shot.

Property	Chain-of-Thought	ReAct
Factual grounding	Training data only	Live tool calls
Hallucination risk	High on obscure facts	Low (verified at each step)
Self-correction	None (one-shot)	Yes (observation feedback)
Transparency	Yes (reasoning visible)	Yes (full trace)
Reasoning capability	Full LLM reasoning	Full LLM reasoning
Works offline	Yes	No (needs tools)

Interestingly, ReAct can be combined with CoT: the CoT → ReAct strategy uses CoT when the model is confident (straightforward reasoning) and switches to ReAct when it needs to verify a fact. The paper calls this the "self-consistency + ReAct" combination and shows it achieves the best of both worlds.

What is Chain-of-Thought's key failure mode on factual multi-hop questions?

The model hallucinates factual premises and then reasons logically from them, producing confident but wrong answers with no mechanism for self-correction Chain-of-Thought cannot handle questions with more than two hops The reasoning chain gets too long for the model's context window

Chapter 4: vs Act-Only

The other baseline: a pure tool-using agent with no reasoning traces. For each step, the model just outputs an action directly. Call it Act-only. It can call the same tools as ReAct but produces no thoughts between observations.

Act-only's failure modes are the mirror image of CoT's:

No plan: Without articulating what it's looking for, the agent tends to call tools in a pattern-matched way ("I got partial info, so I'll search again") rather than deciding based on the task's logical structure.
No memory of intent: After 5 tool calls, the model may have forgotten what sub-question it was trying to answer. The observation from step 5 gets processed in context but without a thought connecting it back to the original question.
Loops: Act-only agents often get stuck in search loops — they search a term, don't find what they want, search a related term, don't find it, search the first term again. Without articulating "I already searched X and it didn't help," they repeat.

Thoughts as working memory. The thought steps in ReAct serve the same function as a scratch pad in a math problem: they externalize the agent's current state of understanding so that future reasoning steps can build on it. Act-only has no scratch pad. Each action is chosen with only the raw observations in context — no digest, no synthesis, no plan.

Property	Act-Only	ReAct
External knowledge	Yes (tools)	Yes (tools)
Planning between steps	Implicit only	Explicit thought trace
Loop detection	No	Yes (thought records what was tried)
Sub-goal tracking	Fragile	Explicit in thought steps
Interpretability	Low (opaque action sequence)	High (full reasoning visible)
Token cost	Lower	Higher (thoughts add tokens)

What is Act-only's most common failure mode on multi-hop tasks?

The agent hallucinates observations that weren't returned by the tool Without thought traces to track intent and record what was tried, agents loop through redundant searches and lose track of which sub-goal they're pursuing Act-only agents refuse to call the same tool twice

Chapter 5: Tool Integration

ReAct is tool-agnostic. The paper demonstrates it on Wikipedia search, but the pattern applies to any tool with a text input and text output: calculators, code interpreters, database queries, web browsers, calendar APIs, sensor readings. The model generates an action in natural language; a parser routes it to the right tool; the tool returns an observation in natural language; the observation goes back into context.

The Original Tool Suite (Wikipedia Tasks)

Search[entity] → Returns first paragraph of entity's Wikipedia article (truncated to 100 words)
Lookup[keyword] → Returns next sentence containing keyword in the current article (like browser Ctrl+F)
Finish[answer] → Terminates the episode with the final answer string

Why so minimal a tool suite? The paper deliberately uses only 3 simple tools to prove that the ReAct pattern's power comes from the reasoning-acting interleaving, not from tool richness. Modern agentic systems (GPT-4 with tools, Claude with computer use, Gemini with extensions) extend this with dozens of tools, but the core loop is identical: Thought → Action[tool, arg] → Observation → Thought...

Tool Design for ReAct

Good tools for ReAct agents have three properties:

Text in, text out: The model generates and reads natural language. Tools that return structured JSON or binary data need a text serializer.
Deterministic + fast: The agent may call a tool multiple times. Stochastic tools (random responses) confuse the reasoning trace. Slow tools (10+ seconds) make the trace feel interactive but break user experience at scale.
Graceful failure messages: If Search[X] finds nothing, return "No Wikipedia article found for X" — not an empty string. The thought step needs something to reason about even when the tool fails.

Action Parser

See how an action string from the LLM is parsed into (tool, argument). Edit the action string and watch the parser extract the components.

What is the minimum requirement for a tool to be usable in the ReAct framework?

The tool must accept text input (the action argument) and return text output (the observation) — so the model can reason about its response in natural language The tool must have a formal JSON schema for its arguments The tool must respond within 100 milliseconds

Chapter 6: Results

ReAct was evaluated on four benchmarks using PaLM-540B and GPT-3 with few-shot prompting. No fine-tuning — just in-context examples of the thought-action-observation pattern.

Task	Dataset	Metric	CoT	Act	ReAct	Best
Multi-hop QA	HotpotQA	EM	29.4%	25.7%	27.4%	ReAct+CoT: 35.1%
Fact verif.	FEVER	Acc.	56.3%	58.9%	60.9%	ReAct+CoT: 64.6%
Decision (text)	ALFWorld	Succ.	—	45%	71%	ReAct: 71%
Decision (web)	WebShop	Score	—	49.9%	40.0%	Fine-tuned: 62%

The pattern is clear: ReAct outperforms Act-only on tasks that require multi-step reasoning (ALFWorld, FEVER). CoT + ReAct combination consistently beats both pure methods. The one exception is WebShop, where Act-only slightly outperforms ReAct — the paper attributes this to WebShop's action space being large and continuous, where generating thoughts adds noise without providing useful grounding.

Qualitative finding — error analysis: The paper analyzed 50 errors from each method on HotpotQA. CoT errors: 56% hallucination (made up a fact), 23% reasoning error (correct facts, wrong deduction), 21% incomplete search. ReAct errors: 47% insufficient search (needed one more tool call), 28% action parsing error (model didn't format the action correctly), 25% reasoning error. ReAct virtually eliminates hallucination errors; its new failure mode is insufficient search depth.

What failure mode does ReAct nearly eliminate compared to Chain-of-Thought?

Hallucination — ReAct verifies facts via tool calls rather than relying on potentially incorrect training-time memorization Reasoning errors — ReAct's thought traces prevent any logical mistakes Context length overflow — ReAct compresses observations before adding them to context

Chapter 7: Connections

ReAct is one of the foundational papers in what is now called agentic AI — LLMs that act in the world, not just generate text. Understanding its lineage and descendants maps the field.

Method	Key Idea	vs ReAct
CoT (Wei et al. 2022)	Reason step by step	ReAct adds actions to CoT's reasoning
ReAct (this paper)	Interleave thought + action	—
Reflexion (Shinn et al. 2023)	Reflect on failure, store in memory	ReAct per episode; Reflexion across episodes
Toolformer (Schick 2023)	Fine-tune to call APIs	ReAct: prompting only; Toolformer: weights updated
Tree of Thoughts (Yao 2023)	Branch and prune reasoning paths	ReAct: linear trace; ToT: tree search over thoughts
SWE-agent (Yang 2024)	ReAct loop for code editing	Direct descendant with specialized tools

ReAct → Reflexion: ReAct operates within a single episode. Reflexion (next paper) adds an outer loop: after a ReAct episode fails, a separate reflection step generates a verbal critique ("I looped on the same search too many times") and stores it in persistent memory. The next episode's context includes this critique, allowing the agent to avoid the same mistake. Reflexion is ReAct + episodic memory + verbal RL.

Limitations of ReAct: (1) Token cost: thoughts add 30-50 tokens per step; a 10-step ReAct trace costs 300-500 extra tokens. (2) Context length: long traces fill the context window, eventually dropping earlier observations. (3) Parsing fragility: if the model generates a malformed action ("Srch[X]" instead of "Search[X]"), the whole trace breaks. (4) No learning: each episode starts fresh; mistakes in one episode are not remembered in the next (Reflexion fixes this). (5) Tool dependency: degraded or unavailable tools cause cascade failures the reasoning can't recover from.

Go Deeper

Reflexion (2023) — verbal reinforcement learning across episodes
Chain-of-Thought Prompting (Wei et al. 2022) — the reasoning-only baseline
Tree of Thoughts (2023) — branching reasoning with search

Key Paper

Yao et al. "ReAct: Synergizing Reasoning and Acting in Language Models." 2022. arXiv:2210.03629

"Act not just to execute, but to explore. Think not just to plan, but to verify." — the ReAct philosophy