CS224N Lecture 10 — Agents, Tools, and RAG

Chapter 0: Why Agents?

Ask GPT the weather right now. It can't. Ask it to multiply 7-digit numbers. It gets it wrong. Ask it who won last night's game. It doesn't know. LLMs are brains in jars — they can reason about the world but they cannot perceive it, search it, or act in it.

This is a fundamental architectural limitation, not a training bug. An LLM's knowledge is frozen at the time of its last training data cutoff. It cannot access the internet. It cannot run code. It cannot look up a fact in a database. Everything it "knows" must already be baked into its parameters — and those parameters are static after training.

Even within its training data, the model memorizes rather than understands many facts. Ask GPT-4 to multiply 4,738,291 by 8,234,107 and it will confidently produce a wrong answer. It doesn't have a calculator — it has a pattern-matching system that has seen multiplication examples during training and is doing its best to pattern-match a new one.

Three Failure Modes

The simulation below demonstrates the three core failure modes that motivate everything in this lesson:

1. Mathematical reasoning. LLMs generate text token by token. They don't have an ALU (arithmetic logic unit). Multi-digit arithmetic requires carrying, borrowing, and intermediate state — operations that don't naturally map to next-token prediction.

2. Temporal knowledge. The model's training data has a cutoff date. Any event after that date is unknowable. Even events before the cutoff may be misremembered or confused with similar events from different times.

3. Factual lookup. The model stores facts in its weights implicitly, as statistical patterns across billions of parameters. It can't look up a specific fact with certainty — it can only generate what's statistically likely given the prompt. This leads to hallucination: confident, fluent, wrong answers.

LLM Failure Modes

Click a query to see what the LLM produces (wrong answer in red) and what the correct answer is (in green). These are failures that tools can fix.

In each case, the fix is the same conceptually: let the model delegate to an external system that can actually do the job. A calculator for math. A search engine for current events. A database for factual lookup. The model's role shifts from "know everything" to "know when and how to ask for help."

An LLM without tools is a brain in a jar — it can think but can't see, search, or calculate. The entire field of LLM agents is about breaking this jar: giving models the ability to take actions in the world and observe the results.

The Agent Paradigm

An agent is an LLM that can take actions. Instead of just generating text, it can call tools — search engines, calculators, code interpreters, APIs — and incorporate the results into its reasoning. The key loop is: Think → Act → Observe → Repeat.

This lesson builds the agent stack from the bottom up. We start with retrieval-augmented generation (RAG) — the simplest form of "letting the model look things up." Then we add tool use — letting the model call functions. Then ReAct and the full agent loop — letting the model reason about what tools to call and when. By the end, you'll understand both how agents work and where they break.

Why can't an LLM correctly multiply 4,738,291 × 8,234,107?

The model wasn't trained on enough math data LLMs predict tokens sequentially and lack an arithmetic unit — they pattern-match rather than compute, and multi-digit arithmetic needs intermediate state they can't maintain The numbers are too large for the model's context window

Chapter 1: Retrieval-Augmented Generation

What if the model could look things up before answering, instead of relying on memorization? This is the core idea of Retrieval-Augmented Generation (RAG): at inference time, retrieve relevant documents from an external knowledge base and condition the model's generation on them.

Think of it as the difference between a closed-book exam and an open-book exam. In a closed-book exam (standard LLM), you must have memorized every fact you need. In an open-book exam (RAG), you can flip to the relevant page before answering. The student still needs to understand the material — the book helps with specifics.

Closed-Book vs. Open-Book

Closed-book (parametric knowledge only): The model stores all knowledge in its parameters. To answer "What is the population of Tokyo?", it must have encoded this fact during training. If the training data is old, the answer is old. If the fact is rare, the model may hallucinate a plausible but wrong number.

Open-book (parametric + retrieved knowledge): Before generating an answer, the model retrieves the top-k most relevant documents from a document store. These documents are concatenated with the query and fed to the generator. The model can now reference specific, up-to-date information rather than relying on stale memories.

Closed-Book vs. Open-Book QA

Toggle between closed-book (pure LLM) and open-book (RAG). Watch how the retriever fetches relevant documents that ground the answer.

The benefits are immediate and measurable. Lewis et al. (2020) showed that RAG-Sequence matched or exceeded the state of the art on Natural Questions, TriviaQA, and WebQuestions — all open-domain QA benchmarks — despite the generator being smaller than comparable closed-book models. The retriever offloads factual memory from the parameters to the document store.

The Architecture at a Glance

Every RAG system has three components:

Document Store

A corpus of documents (Wikipedia, manuals, knowledge base) indexed for fast retrieval. Can be updated at any time without retraining the model.

↓

Retriever

Given a query, finds the top-k most relevant documents. Can be sparse (BM25/TF-IDF) or dense (neural embeddings). The bottleneck for accuracy.

↓

Generator

A language model (e.g., BART, T5) that takes [query + retrieved docs] as input and produces the answer. The generator reads the docs and synthesizes an answer.

RAG separates knowledge (document store) from reasoning (LLM). Update knowledge by adding docs — no retraining. This is huge. An LLM that cost $10M to train can get new knowledge for the cost of indexing a new document. A company's internal knowledge base becomes the model's memory, instantly.

Why Not Just Increase Model Size?

You might think: if the model needs more knowledge, just train a bigger model. But this hits three walls:

Problem	Bigger Model	RAG
New information	Must retrain ($M+)	Add docs (free)
Memory footprint	Grows linearly with knowledge	Knowledge is external
Provenance	Can't cite sources	Can point to retrieved docs
Domain adaptation	Needs domain-specific fine-tuning	Swap in domain-specific corpus

RAG also provides attributability: you can inspect which documents the model used, verify the sources, and debug wrong answers by checking whether the retriever found the right documents. With a closed-book model, wrong answers are opaque — you can't tell whether the model lacked the knowledge or had it but failed to use it.

Data Flow: What Actually Happens

Let's trace a concrete RAG query end to end. The user asks: "What is the tallest building in the world as of 2024?"

python
# Step 1: Encode the query
query = "What is the tallest building in the world as of 2024?"
q_emb = query_encoder(query)   # shape: [768]

# Step 2: Retrieve top-k docs from index
scores = q_emb @ doc_embeddings.T   # shape: [num_docs]
top_k_ids = scores.argsort()[-5:]  # top 5 doc IDs
docs = [corpus[i] for i in top_k_ids]

# Step 3: Concatenate query + docs as generator input
context = "\n\n".join(docs)
input_text = f"question: {query} context: {context}"

# Step 4: Generate answer conditioned on retrieved docs
answer = generator.generate(input_text)
# → "The Burj Khalifa in Dubai at 828m (2,717 ft)"

The key insight: the generator never needs to have memorized "Burj Khalifa." It reads the retrieved document and extracts the answer. The retriever does the knowledge work; the generator does the language work.

What is the main advantage of RAG over simply training a larger model?

Knowledge can be updated by adding documents without retraining the model, and answers can be attributed to specific sources RAG models generate faster text RAG eliminates all hallucinations

Chapter 2: Dense Retrieval

Not keyword matching — encode meaning into vectors. Traditional information retrieval systems like BM25 match documents to queries based on word overlap: if the query contains "tallest building," find documents containing those exact words. This works surprisingly well for many tasks, but it fails when the query and the answer use different words for the same concept.

Consider the query "What's the largest skyscraper?" and a document that says "The Burj Khalifa stands at 828 meters, making it the tallest structure ever built." BM25 might miss this because the query says "largest skyscraper" while the document says "tallest structure." A human instantly recognizes these mean the same thing. Dense retrieval bridges this gap by encoding both queries and documents as vectors in a shared embedding space, where semantic similarity is measured by vector distance.

Two Encoders, One Space

Dense Passage Retrieval (DPR), introduced by Karpukhin et al. (2020), uses two separate BERT encoders:

Query encoder E_Q: Takes the question as input, outputs a 768-dimensional vector representing the question's meaning. This encoder learns to represent "what information is needed" in vector space.

Document encoder E_D: Takes a document passage as input, outputs a 768-dimensional vector representing the passage's content. This encoder learns to represent "what information is available" in the same vector space.

Training objective: for a question-passage pair (q, p⁺) known to be relevant, maximize the dot product E_Q(q) · E_D(p⁺) while minimizing it for irrelevant passages p⁻. This pushes relevant pairs close together and irrelevant pairs apart in the shared space.

sim(q, d) = E_Q(q)^T · E_D(d)

At inference time, the retriever encodes the query once, then performs a Maximum Inner Product Search (MIPS) over the pre-computed document embeddings. Libraries like FAISS can search billions of vectors in milliseconds using approximate nearest neighbor algorithms.

Dense Retrieval: Embedding Space

Document dots (teal) and query dot (warm). Click a query to see which documents are closest in embedding space. Lines show nearest neighbors.

DPR: two BERT encoders — one for queries, one for docs. Trained so relevant pairs have high dot-product. At inference, encode the query once, then search pre-computed doc embeddings with FAISS. The document index can contain millions of passages and be searched in ~10ms.

Training DPR

DPR uses a contrastive loss similar to what you'd see in CLIP. For each question, there's one positive passage (the correct answer passage) and several negative passages. The loss for a single question q_i with positive passage p_i⁺ is:

L(q_i, p_i⁺, p_i,1⁻, ..., p_i,n⁻) = −log (e^{sim(q_i, p_i⁺)} / (e^{sim(q_i, p_i⁺)} + ∑_j e^{sim(q_i, p_i,j⁻)}))

Where do the negatives come from? DPR uses three types:

Negative Type	Source	Purpose
Random	Random passages from the corpus	Easy negatives to establish baseline separation
BM25	Top BM25 results that aren't the answer	Hard negatives that share keywords but wrong content
In-batch	Positive passages for other questions in the same batch	Efficient hard negatives — free compute

In-batch negatives are particularly clever: in a batch of 128 questions, each question's positive passage serves as a negative for the other 127 questions. This gives 127 free negatives per question without any extra encoder forward passes.

DPR vs. BM25: When Dense Wins

Dense retrieval isn't always better than BM25. The tradeoffs are clear:

Dimension	BM25 (sparse)	DPR (dense)
Lexical overlap queries	Strong	Comparable
Semantic matching	Fails	Strong
Zero-shot (new domain)	Strong	Needs fine-tuning
Index size	Inverted index, compact	Dense vectors, ~3x larger
Search speed	~1ms	~10ms (with FAISS)

In practice, many production systems use a hybrid approach: BM25 for initial candidate retrieval (fast, keyword-based), then a dense reranker to score the top candidates by semantic similarity. This combines the speed of sparse retrieval with the accuracy of dense matching.

What makes dense retrieval (DPR) better than BM25 for the query "largest skyscraper" when the answer document says "tallest structure"?

DPR searches faster than BM25 DPR encodes semantic meaning into vectors, so "largest skyscraper" and "tallest structure" map to nearby points even without shared keywords DPR uses a bigger database than BM25

Chapter 3: RAG Architecture

The retriever learns to find documents that help THIS generator produce correct answers. This is the insight that distinguishes RAG from simple "retrieve then read" pipelines: the retriever and generator can be trained jointly, end-to-end, so the retriever learns what the generator needs.

The RAG Pipeline

Lewis et al. (2020) proposed two variants of RAG, both sharing the same pipeline but differing in how they marginalize over retrieved documents:

Step 1: Query encoding. The input question x is encoded by the query encoder (BERT-based DPR) into a dense vector q = E_Q(x). Shape: [768].

Step 2: MIPS retrieval. The query vector is used to retrieve the top-k documents (typically k=5 or k=10) from a FAISS index containing millions of pre-computed document embeddings. This takes ~10ms.

Step 3: Concatenation. Each retrieved document d_i is concatenated with the original question: input_i = [x; d_i]. This creates k separate input sequences for the generator.

Step 4: Generation. The generator (BART-large) produces output tokens conditioned on each input. The final output marginalizes over the k documents.

Step 5: Marginalization. This is where the two variants differ.

RAG Pipeline Step-Through

Click "Next Step" to walk through the RAG pipeline. Watch the query get encoded, documents retrieved, concatenated, and the answer generated.

Step 0 / 5

RAG-Token vs. RAG-Sequence

The two variants differ in when they marginalize over retrieved documents:

RAG-Sequence: For each retrieved document, generate the entire output sequence. Then pick the sequence with the highest probability (marginalized over documents). Formally:

p_RAG-Seq(y|x) ≈ ∑_{z ∈ top-k} p_η(z|x) ∏_i p_θ(y_i|x, z, y_1:i-1)

Each document produces one complete answer. The final answer is the one with the highest total probability across all documents. Think of it as: "Get five opinions, pick the best one."

RAG-Token: At each token position, marginalize over all k documents independently. Different tokens can be "sourced" from different documents. Formally:

p_RAG-Tok(y|x) ≈ ∏_i ∑_{z ∈ top-k} p_η(z|x) p_θ(y_i|x, z, y_1:i-1)

The sum and product are swapped. Each output token can draw from a different document. Think of it as: "For each word, consult all five sources and pick the best word."

RAG-Token: different docs contribute to different words. RAG-Sequence: one doc per answer. Token is more flexible. In practice, RAG-Token is slightly better for tasks where the answer synthesizes information from multiple sources (e.g., "List three reasons..."). RAG-Sequence is simpler and works well for factoid QA where one passage contains the whole answer.

Training: End-to-End Learning

The generator (BART) is trained with standard cross-entropy loss on the target answer tokens. The retriever gradient flows through the document retrieval step via the retrieval probability p_η(z|x). In practice, the document index can't be updated every gradient step (re-indexing millions of documents is expensive), so the document encoder is updated periodically and the index is rebuilt every few hundred steps.

The training data is simply (question, answer) pairs — no document annotations needed. The retriever learns to find useful documents by backpropagating through the generation loss. If a document helps the generator produce the correct answer, the retriever learns to rank that document higher.

python
# Simplified RAG-Token forward pass
def rag_token_forward(query, top_k=5):
    # Encode query
    q_emb = query_encoder(query)          # [768]

    # Retrieve top-k docs
    doc_scores, doc_ids = faiss_index.search(q_emb, top_k)
    docs = [corpus[i] for i in doc_ids]
    retrieval_probs = softmax(doc_scores)  # [k]

    # Generate with each doc
    all_logits = []
    for doc in docs:
        input_text = query + " [SEP] " + doc
        logits = generator(input_text)     # [seq_len, vocab]
        all_logits.append(logits)

    # Marginalize: for each token position,
    # weighted sum of probs across all docs
    all_logits = stack(all_logits)          # [k, seq_len, vocab]
    all_probs = softmax(all_logits, dim=-1)
    # retrieval_probs: [k, 1, 1] broadcast
    marginal = (retrieval_probs * all_probs).sum(dim=0)  # [seq_len, vocab]
    return marginal

In RAG-Token, how is the output for each token position determined?

Each token uses only the single most relevant document The model ignores the retrieved documents and generates freely Each token position marginalizes over all k documents — the probability of each token is a weighted sum across all retrieved documents

Chapter 4: ReAct

What if the model could think out loud, take an action, observe the result, then think again? This is ReAct (Yao et al., 2022) — a prompting framework that interleaves Reasoning (chain-of-thought) with Acting (tool calls) in a single LLM generation.

Before ReAct, there were two separate approaches to improving LLM performance. Chain-of-thought (CoT) prompting lets the model "think step by step" before answering, improving reasoning accuracy. Action-based approaches let the model call external tools (search, lookup, calculate). ReAct's insight: combine both. Let the model alternate between thinking and acting.

The Thought-Action-Observation Loop

A ReAct trace has three types of steps, repeated in a loop:

Thought: The model reasons about what it knows and what it needs to find out. This is internal — it's CoT reasoning. Example: "I need to find who directed Inception, then find that director's birth year."

Action: The model emits a structured action string that invokes an external tool. Example: "Search[Inception film director]". The available actions are defined in the prompt — typically Search[query], Lookup[keyword], and Finish[answer].

Observation: The tool returns its result, which is appended to the context. Example: "Inception is a 2010 film directed by Christopher Nolan." The model then reads this observation and decides what to do next.

ReAct: Thought → Action → Observation Loop

Click "Next Step" to advance through a ReAct trace solving a multi-hop question. Watch the model think, act, observe, and repeat.

Step 0

ReAct traces are human-readable. You can see WHY the agent acted — unlike CoT which only reasons. When a ReAct agent gets the wrong answer, you can read the trace and pinpoint where it went wrong: did it search for the wrong thing? Did it misinterpret the observation? Did it reason incorrectly? This debuggability is a major practical advantage.

ReAct vs. CoT vs. Act-Only

Yao et al. compared three approaches on multi-hop reasoning tasks (HotpotQA) and fact verification (FEVER):

Method	HotpotQA (EM)	FEVER (Acc)	Traces Readable?
CoT (reason only, no tools)	29.4	56.3	Yes (but can hallucinate facts)
Act-only (tools, no reasoning)	25.7	58.9	No (black-box actions)
ReAct (reason + act)	35.1	64.6	Yes (grounded by observations)

The key finding: reasoning without acting hallucinates facts (CoT makes up details it could have looked up). Acting without reasoning makes wrong action choices (the model searches for the wrong thing because it didn't plan). ReAct gets the best of both: reasoning guides the search, and search grounds the reasoning.

The Prompt Format

ReAct is a pure prompting approach — no model fine-tuning required. You provide few-shot examples of Thought/Action/Observation traces in the prompt, and the model continues the pattern for new questions:

ReAct prompt structure
Question: Were Scott Derrickson and Ed Wood of the same nationality?

Thought 1: I need to find the nationality of Scott Derrickson
            and Ed Wood, then compare.
Action 1: Search[Scott Derrickson]
Observation 1: Scott Derrickson (born July 16, 1966) is an
               American filmmaker.
Thought 2: Scott Derrickson is American. Now I need to find
            Ed Wood's nationality.
Action 2: Search[Ed Wood]
Observation 2: Edward Davis Wood Jr. was an American filmmaker.
Thought 3: Both are American, so yes, they share the same
            nationality.
Action 3: Finish[yes]

The model learns to follow this pattern from the examples. It generates "Thought N:" to reason, "Action N:" to call a tool, and then the system injects "Observation N:" with the tool's output before letting the model continue.

Error Recovery

One of ReAct's strengths is error recovery. If a search returns no results or irrelevant information, the model can recognize this in its next Thought step and reformulate the query. For example:

error recovery
Thought 3: The search didn't return the birth year.
            Let me try a more specific query.
Action 3: Search[Christopher Nolan birth date]
Observation 3: Christopher Edward Nolan was born on
               30 July 1970 in London.
Thought 4: Found it. Christopher Nolan was born in 1970.

This self-correcting behavior emerges from the few-shot examples — if you include examples where the agent recovers from bad searches, the model learns to do the same.

Why does ReAct outperform both pure CoT and pure Act-only approaches?

Reasoning guides which tools to call and how to interpret results, while tool outputs ground the reasoning in real facts — preventing both hallucination (CoT alone) and wrong tool choices (Act alone) ReAct uses a larger language model than CoT or Act-only ReAct fine-tunes the model on reasoning examples

Chapter 5: Tool Use

The model writes "CALC(347 × 892)". A calculator returns 309,324. The model continues with the correct number. This is tool use in its simplest form: the LLM emits a special token sequence that triggers an external computation, and the result is injected back into the generation.

The concept is powerful because it plays to each component's strengths. The LLM excels at understanding natural language, decomposing problems, and knowing when a calculation or lookup is needed. The tool excels at the actual computation. Together they achieve what neither can alone.

How Tool Calls Work

At the implementation level, tool use follows a simple protocol:

1. Generation pause. The model generates tokens normally until it produces a special tool-call token or pattern (e.g., [CALC], <tool_call>, or a JSON block). Generation pauses.

2. Argument extraction. The system parses the tool name and arguments from the generated tokens. For example, CALC(347 * 892) → tool="calculator", args="347 * 892".

3. Tool execution. The external tool runs with the provided arguments. A calculator evaluates the expression. A search engine queries the web. A Python interpreter runs the code.

4. Result injection. The tool's output is inserted into the context, typically formatted as [RESULT: 309324]. The model then continues generating from this point, with access to the correct result.

Tool Call Execution

Select a tool type, then click "Generate" to watch the LLM generate tokens, pause for a tool call, execute the tool, and continue.

The model doesn't need to BE a calculator. It needs to know WHEN to call one and HOW to format the call. This is a profound shift. Instead of trying to make LLMs do everything, we give them the metacognitive ability to recognize their own limitations and delegate to specialized tools.

Common Tool Types

Tool	Input	Output	Fixes
Calculator	Math expression	Numeric result	Arithmetic errors
Web search	Search query	Top results + snippets	Knowledge cutoff
Code interpreter	Python code	stdout + return value	Complex logic, data processing
Knowledge base	Structured query	Matching records	Domain-specific lookup
Calendar/API	API call parameters	API response	Real-world actions

Modern Tool Call Formats

Different systems use different formats for tool calls. The trend is toward structured JSON:

OpenAI function calling format
{
  "tool_calls": [{
    "type": "function",
    "function": {
      "name": "get_weather",
      "arguments": "{\"location\": \"Tokyo\", \"unit\": \"celsius\"}"
    }
  }]
}

Anthropic tool use format
{
  "type": "tool_use",
  "name": "calculator",
  "input": {"expression": "347 * 892"}
}

The model is trained (via SFT and RLHF) to produce these structured outputs when appropriate. The key engineering decisions:

Tool descriptions: Each tool comes with a schema (name, description, parameters, types). The model sees these descriptions in its system prompt and learns to match tasks to available tools.

Parallel vs. sequential calls: Some systems allow the model to request multiple tool calls in a single turn (e.g., search for two things simultaneously). Others require one call at a time with observation between calls.

Error handling: When a tool call fails (invalid arguments, timeout, API error), the error message is returned as the observation. Well-trained models learn to retry with corrected arguments or try a different approach.

What happens when an LLM generates a tool call token during text generation?

Generation pauses, the tool arguments are extracted and sent to the external tool, the tool's result is injected back into the context, and generation continues The LLM internally simulates the tool and generates the answer itself The tool call is ignored and the model keeps generating text

Chapter 6: Toolformer

What if the model figured out on its own when tools would help, with no human labels? Toolformer (Schick et al., 2023) does exactly this: it uses the language model's own loss to decide where tool calls should be inserted into training data, then fine-tunes the model on this self-annotated data.

The key insight: if inserting a tool call at a particular position in the text reduces the model's perplexity on subsequent tokens, that tool call is useful. If it doesn't help (or hurts), discard it. The LM loss itself becomes the label.

The Toolformer Pipeline

Toolformer works in five steps, all automated:

Step 1: Sample candidate positions. For each position in the training text, ask the model: "Would a tool call be helpful here?" Use a few-shot prompt to generate candidate tool calls at promising positions. Not every position gets a candidate — only positions where the model, prompted with examples of tool usage, generates a tool call.

Step 2: Generate candidate calls. At each sampled position, generate several candidate tool calls using the LLM itself. For a calculator tool, this might produce [CALC(15 * 7.3)]. For a search tool, [SEARCH("population of France")].

Step 3: Execute the calls. Run each candidate tool call to get the result. [CALC(15 * 7.3)] → 109.5.

Step 4: Compare perplexity. For each candidate, measure the model's perplexity on the tokens after the tool call position in two conditions: (a) with the tool call and result inserted, and (b) without. If condition (a) has significantly lower perplexity, the tool call is useful.

Step 5: Keep or discard. Only tool calls that reduce perplexity by at least a threshold τ are kept. These are inserted into the training data. The model is then fine-tuned on this augmented data, learning to generate tool calls at appropriate positions during normal text generation.

Toolformer: Self-Supervised Tool Learning

Click "Next Step" to walk through the Toolformer pipeline. Watch candidate tool calls get generated, executed, evaluated, and either kept (green) or discarded (red).

Step 0 / 5

Zero human-labeled examples. The LM loss itself signals whether a tool call was useful. This is elegant: the model's own uncertainty (perplexity) becomes the supervision signal. If the model is uncertain about "15 × 7.3 = ?" and a calculator makes it certain, the tool call earns its place.

The Filtering Criterion

Formally, let x be the text, i be the candidate position, c be the tool call, r be the tool result, and L_i be the loss on tokens after position i. The filtering criterion is:

L_i(with empty call) − L_i(with c → r) ≥ τ

We compare against an "empty call" (just the API syntax with no content) rather than no call at all. This ensures the benefit comes from the tool's result, not from the tool call syntax acting as a formatting cue.

Results: Which Tools Help Most?

Schick et al. tested Toolformer with five tools on a 6.7B parameter GPT-J model:

Tool	Task	Without Tool	With Toolformer	Improvement
Calculator	Math (ASDiv)	43.3%	76.6%	+33.3%
Wikipedia search	QA (TempLAMA)	21.0%	40.5%	+19.5%
Calendar	Date tasks	36.0%	86.4%	+50.4%
Machine translation	MLQA	26.2%	32.1%	+5.9%
QA system	Knowledge QA	34.1%	40.8%	+6.7%

The calculator and calendar tools provide the largest improvements because they address tasks where the LLM is fundamentally limited: exact arithmetic and temporal reasoning. Search and QA tools help less because the model already has substantial factual knowledge from pretraining — tools mainly help for facts that are rare or post-cutoff.

Limitations

Toolformer has important constraints. It can only use tools that can be expressed as text-in, text-out APIs. It can only use one tool per position (no chaining). And the self-supervised filtering is noisy — some useful tool calls are discarded, and some useless ones slip through. Modern systems (GPT-4, Claude) use a different approach: tool use is baked into the RLHF training process with human feedback on when tools should be called.

How does Toolformer decide whether a candidate tool call is worth keeping?

Human annotators review each candidate tool call It compares the model's perplexity on subsequent tokens with and without the tool call — if the tool call reduces perplexity by at least a threshold τ, it's kept It randomly selects half the candidates to keep

Chapter 7: Agent Loop

You give the agent a question. Watch it think, search, calculate, and arrive at the answer. This is the showcase chapter — a full agent simulator that combines everything from the previous chapters: retrieval, tool use, and the ReAct reasoning loop.

A real agent orchestrator works like this: the LLM receives a system prompt listing its available tools (search, calculator, Python, knowledge base), a set of few-shot examples showing the Thought/Action/Observation format, and the user's question. It then enters a loop:

Thought

The model reasons about what it knows and what it still needs. "I need to find X to answer the question."

↓

Action

The model selects a tool and formulates a query. "Search[X topic]" or "CALC(expression)".

↓

Observation

The tool returns its result. The model reads the result and integrates it into its reasoning.

↻ repeat until Action = Finish[answer]

The Full Agent Simulator

Select a preset task below, then step through the agent's execution. Watch the reasoning trace build step by step. The tool panel on the right lights up when a tool is called. The document panel shows retrieved content.

Agent Simulator

Select a task, then click "Step" to advance one T/A/O step, or "Auto-Run" to watch the full trace play out.

Select a task to begin.

Try the multi-hop question — watch the agent chain multiple searches and calculations to build an answer. Notice how the agent sometimes retrieves irrelevant documents and must reformulate its query. This is the fundamental challenge: robustness. The agent must handle noisy, incomplete, and sometimes contradictory information from its tools.

Agent Architecture in Production

Real production agents (like those powering ChatGPT, Claude, and Gemini) use a more sophisticated version of this loop. Key engineering decisions:

Tool selection: The model sees a schema of all available tools in its system prompt. It must select the right tool, provide correctly-formatted arguments, and handle errors gracefully. This is trained during RLHF — the model learns from human preferences about which tool calls are appropriate.

Context management: As the T/A/O trace grows, the context window fills up. Production systems manage this by summarizing earlier steps, truncating irrelevant observations, or using sliding-window approaches.

Safety guardrails: Before executing a tool call, the system checks: Is this action safe? Does it access sensitive data? Could it have irreversible side effects? High-risk actions (sending emails, modifying databases) may require user confirmation.

Max iterations: A hard limit on loop iterations prevents infinite loops. If the agent hasn't produced a Finish action after N steps (typically 5-15), it's forced to produce a best-effort answer from what it has.

python
# Simplified agent loop
def agent_loop(question, tools, max_steps=10):
    context = SYSTEM_PROMPT + FEW_SHOT_EXAMPLES
    context += f"\nQuestion: {question}\n"

    for step in range(max_steps):
        # Generate next thought + action
        response = llm.generate(context, stop=["Observation:"])
        context += response

        # Parse action
        action = parse_action(response)
        if action.tool == "Finish":
            return action.args  # final answer

        # Execute tool
        try:
            result = tools[action.tool].execute(action.args)
        except Exception as e:
            result = f"Error: {e}"

        # Append observation
        context += f"\nObservation: {result}\n"

    return "Failed to reach answer in max steps."

Chapter 8: Limitations

The agent looks impressive. But what happens when it hallucinates a tool call that doesn't exist? What happens when it enters an infinite loop of searches that never converge? What happens when it calls the right tool with the wrong arguments? Agent failures are LLM failures amplified — because the model can now act, its mistakes have real consequences.

Failure Gallery

The simulation below shows five common failure modes. Click through each one to see the agent trace where things go wrong.

Agent Failure Modes

Click a failure mode to see an animated trace showing where and how the agent fails.

The Five Failure Modes, Explained

1. Hallucinated tool calls. The model generates a call to a tool that doesn't exist, like WeatherAPI("Tokyo") when only Search and Calculator are available. The model has seen API calls in its training data and hallucinates plausible-looking but nonexistent ones. This is especially common when the model is used with a different tool set than it was trained with.

2. Infinite loops. The model repeatedly searches for the same thing with slightly different wording, never making progress. "Search[population of France]" → observation doesn't contain the answer → "Search[France population 2024]" → same result → "Search[how many people in France]" → forever. Without a max-iteration limit, this runs until the context window fills up.

3. Wrong tool selection. The model uses a calculator when it should search, or searches when it should calculate. Example: "What is 15% of the US national debt?" — the model searches for "15% of national debt" instead of first searching for the debt amount, then calculating 15% of it.

4. Wrong result interpretation. The tool returns correct results, but the model misreads or misinterprets them. A search returns a Wikipedia article about Tokyo's population, but the model reads a number from the wrong row (metropolitan area vs. city proper) and produces an incorrect answer with high confidence.

5. Unsafe actions. The model calls tools with dangerous side effects. "Delete all files matching pattern *" or "Send email to all-company@..." or "Execute SQL: DROP TABLE users". In production, every tool call should be sandboxed and audited, with destructive actions requiring human confirmation.

Agent failures are LLM failures amplified. A hallucinated fact is annoying. A hallucinated API call can delete your database. This is why production agents have strict guardrails: tool allowlists (only pre-approved tools), argument validation (check types and ranges before execution), side-effect classification (read-only vs. write vs. destructive), and human-in-the-loop approval for high-stakes actions.

Mitigation Strategies

Failure Mode	Mitigation	Trade-off
Hallucinated APIs	Tool allowlist + schema validation	Limits flexibility; can't use new tools
Infinite loops	Max iteration limit + loop detection	May terminate before finding the answer
Wrong tool	Better tool descriptions + few-shot examples	Longer prompts consume context
Bad interpretation	Self-verification step ("Does this answer the question?")	Extra LLM call = latency + cost
Unsafe actions	Sandboxing + human-in-the-loop for writes	Slower; breaks autonomous operation

The Open Problem: Robustness

Current agents work well on clean, well-structured tasks where the right tool and the right query are obvious. They struggle on tasks requiring:

Ambiguous queries: "Tell me about the bridge in that city" — which bridge? Which city? The model must ask clarifying questions rather than guessing.

Multi-step planning: Tasks requiring 10+ steps with dependencies between steps have rapidly compounding error rates. If each step has 90% accuracy, 10 steps yield 35% end-to-end accuracy.

Novel tool combinations: Using tools in creative ways the model hasn't seen in training. "Use the calculator output as a search query" or "Write code that calls the search API."

These challenges are active research areas. The field is moving toward better planning algorithms (tree search over action sequences), self-reflection (the model critiques its own trace), and learned verifiers (a separate model checks the agent's work).

Why do agent failures have more serious consequences than standard LLM hallucinations?

Because agents can execute actions in the real world — a hallucinated API call can delete data, send emails, or make irreversible changes, unlike a hallucinated text response which is merely incorrect Because agent models are larger and therefore harder to fix Because agents always produce wrong answers

Chapter 9: Connections

Agents sit at the top of the LLM capability stack. Everything we've covered in earlier lectures builds toward this: pretraining gives the model language understanding (L07), post-training aligns it with human preferences and teaches it to follow instructions (L08), efficient adaptation methods let us customize it for specific domains (L09), and now agents give it the ability to act in the world.

The LLM Capability Stack

How each lecture layer builds on the previous ones. Agents require all layers below.

Key Papers

Paper	Year	Contribution
RAG (Lewis et al.)	2020	Joint retriever-generator with end-to-end training. RAG-Token vs RAG-Sequence marginalization.
ReAct (Yao et al.)	2022	Interleaved reasoning and acting. Thought/Action/Observation traces. No fine-tuning needed.
Toolformer (Schick et al.)	2023	Self-supervised tool learning via perplexity filtering. Zero human labels.
DPR (Karpukhin et al.)	2020	Dense passage retrieval with dual BERT encoders. Foundation for neural retrieval.
WebGPT (Nakano et al.)	2021	LLM that browses the web. Trained with RLHF on browsing trajectories.

Connections to Other Lessons

Lesson	Connection
L08: Post-training	RLHF teaches tool selection preferences. SFT teaches tool call format.
L09: PEFT	LoRA lets you fine-tune agents for specific tool sets without full retraining.
L11: Evaluation	Agent evaluation is uniquely hard: must assess tool selection, reasoning quality, and final answer correctness.

The Frontier

The field is evolving rapidly. Current research directions include:

Multi-agent systems: Multiple specialized agents collaborating on a task. One agent plans, another searches, another writes code. Coordination is the hard problem.

Long-horizon planning: Agents that can plan and execute tasks spanning hours or days, maintaining state across sessions. This requires persistent memory beyond the context window.

Self-improvement: Agents that evaluate their own performance and improve their tool-use strategies over time. The agent writes better prompts for itself based on past failures.

Multimodal agents: Agents that can see (vision), hear (audio), and manipulate (robotics). Tool use extends from text APIs to physical actuators.

The progression: Pretraining (L07) teaches language. Post-training (L08) teaches instruction-following. PEFT (L09) teaches specialization. Agents (L10) teach action. Each layer depends on all layers below.