Giving language models the ability to search, calculate, and act in the world.
Ask GPT the weather right now. It can't. Ask it to multiply 7-digit numbers. It gets it wrong. Ask it who won last night's game. It doesn't know. LLMs are brains in jars — they can reason about the world but they cannot perceive it, search it, or act in it.
This is a fundamental architectural limitation, not a training bug. An LLM's knowledge is frozen at the time of its last training data cutoff. It cannot access the internet. It cannot run code. It cannot look up a fact in a database. Everything it "knows" must already be baked into its parameters — and those parameters are static after training.
Even within its training data, the model memorizes rather than understands many facts. Ask GPT-4 to multiply 4,738,291 by 8,234,107 and it will confidently produce a wrong answer. It doesn't have a calculator — it has a pattern-matching system that has seen multiplication examples during training and is doing its best to pattern-match a new one.
The simulation below demonstrates the three core failure modes that motivate everything in this lesson:
1. Mathematical reasoning. LLMs generate text token by token. They don't have an ALU (arithmetic logic unit). Multi-digit arithmetic requires carrying, borrowing, and intermediate state — operations that don't naturally map to next-token prediction.
2. Temporal knowledge. The model's training data has a cutoff date. Any event after that date is unknowable. Even events before the cutoff may be misremembered or confused with similar events from different times.
3. Factual lookup. The model stores facts in its weights implicitly, as statistical patterns across billions of parameters. It can't look up a specific fact with certainty — it can only generate what's statistically likely given the prompt. This leads to hallucination: confident, fluent, wrong answers.
Click a query to see what the LLM produces (wrong answer in red) and what the correct answer is (in green). These are failures that tools can fix.
In each case, the fix is the same conceptually: let the model delegate to an external system that can actually do the job. A calculator for math. A search engine for current events. A database for factual lookup. The model's role shifts from "know everything" to "know when and how to ask for help."
An agent is an LLM that can take actions. Instead of just generating text, it can call tools — search engines, calculators, code interpreters, APIs — and incorporate the results into its reasoning. The key loop is: Think → Act → Observe → Repeat.
This lesson builds the agent stack from the bottom up. We start with retrieval-augmented generation (RAG) — the simplest form of "letting the model look things up." Then we add tool use — letting the model call functions. Then ReAct and the full agent loop — letting the model reason about what tools to call and when. By the end, you'll understand both how agents work and where they break.
What if the model could look things up before answering, instead of relying on memorization? This is the core idea of Retrieval-Augmented Generation (RAG): at inference time, retrieve relevant documents from an external knowledge base and condition the model's generation on them.
Think of it as the difference between a closed-book exam and an open-book exam. In a closed-book exam (standard LLM), you must have memorized every fact you need. In an open-book exam (RAG), you can flip to the relevant page before answering. The student still needs to understand the material — the book helps with specifics.
Closed-book (parametric knowledge only): The model stores all knowledge in its parameters. To answer "What is the population of Tokyo?", it must have encoded this fact during training. If the training data is old, the answer is old. If the fact is rare, the model may hallucinate a plausible but wrong number.
Open-book (parametric + retrieved knowledge): Before generating an answer, the model retrieves the top-k most relevant documents from a document store. These documents are concatenated with the query and fed to the generator. The model can now reference specific, up-to-date information rather than relying on stale memories.
Toggle between closed-book (pure LLM) and open-book (RAG). Watch how the retriever fetches relevant documents that ground the answer.
The benefits are immediate and measurable. Lewis et al. (2020) showed that RAG-Sequence matched or exceeded the state of the art on Natural Questions, TriviaQA, and WebQuestions — all open-domain QA benchmarks — despite the generator being smaller than comparable closed-book models. The retriever offloads factual memory from the parameters to the document store.
Every RAG system has three components:
You might think: if the model needs more knowledge, just train a bigger model. But this hits three walls:
| Problem | Bigger Model | RAG |
|---|---|---|
| New information | Must retrain ($M+) | Add docs (free) |
| Memory footprint | Grows linearly with knowledge | Knowledge is external |
| Provenance | Can't cite sources | Can point to retrieved docs |
| Domain adaptation | Needs domain-specific fine-tuning | Swap in domain-specific corpus |
RAG also provides attributability: you can inspect which documents the model used, verify the sources, and debug wrong answers by checking whether the retriever found the right documents. With a closed-book model, wrong answers are opaque — you can't tell whether the model lacked the knowledge or had it but failed to use it.
Let's trace a concrete RAG query end to end. The user asks: "What is the tallest building in the world as of 2024?"
python # Step 1: Encode the query query = "What is the tallest building in the world as of 2024?" q_emb = query_encoder(query) # shape: [768] # Step 2: Retrieve top-k docs from index scores = q_emb @ doc_embeddings.T # shape: [num_docs] top_k_ids = scores.argsort()[-5:] # top 5 doc IDs docs = [corpus[i] for i in top_k_ids] # Step 3: Concatenate query + docs as generator input context = "\n\n".join(docs) input_text = f"question: {query} context: {context}" # Step 4: Generate answer conditioned on retrieved docs answer = generator.generate(input_text) # → "The Burj Khalifa in Dubai at 828m (2,717 ft)"
The key insight: the generator never needs to have memorized "Burj Khalifa." It reads the retrieved document and extracts the answer. The retriever does the knowledge work; the generator does the language work.
Not keyword matching — encode meaning into vectors. Traditional information retrieval systems like BM25 match documents to queries based on word overlap: if the query contains "tallest building," find documents containing those exact words. This works surprisingly well for many tasks, but it fails when the query and the answer use different words for the same concept.
Consider the query "What's the largest skyscraper?" and a document that says "The Burj Khalifa stands at 828 meters, making it the tallest structure ever built." BM25 might miss this because the query says "largest skyscraper" while the document says "tallest structure." A human instantly recognizes these mean the same thing. Dense retrieval bridges this gap by encoding both queries and documents as vectors in a shared embedding space, where semantic similarity is measured by vector distance.
Dense Passage Retrieval (DPR), introduced by Karpukhin et al. (2020), uses two separate BERT encoders:
Query encoder EQ: Takes the question as input, outputs a 768-dimensional vector representing the question's meaning. This encoder learns to represent "what information is needed" in vector space.
Document encoder ED: Takes a document passage as input, outputs a 768-dimensional vector representing the passage's content. This encoder learns to represent "what information is available" in the same vector space.
Training objective: for a question-passage pair (q, p+) known to be relevant, maximize the dot product EQ(q) · ED(p+) while minimizing it for irrelevant passages p−. This pushes relevant pairs close together and irrelevant pairs apart in the shared space.
At inference time, the retriever encodes the query once, then performs a Maximum Inner Product Search (MIPS) over the pre-computed document embeddings. Libraries like FAISS can search billions of vectors in milliseconds using approximate nearest neighbor algorithms.
Document dots (teal) and query dot (warm). Click a query to see which documents are closest in embedding space. Lines show nearest neighbors.
DPR uses a contrastive loss similar to what you'd see in CLIP. For each question, there's one positive passage (the correct answer passage) and several negative passages. The loss for a single question qi with positive passage pi+ is:
Where do the negatives come from? DPR uses three types:
| Negative Type | Source | Purpose |
|---|---|---|
| Random | Random passages from the corpus | Easy negatives to establish baseline separation |
| BM25 | Top BM25 results that aren't the answer | Hard negatives that share keywords but wrong content |
| In-batch | Positive passages for other questions in the same batch | Efficient hard negatives — free compute |
In-batch negatives are particularly clever: in a batch of 128 questions, each question's positive passage serves as a negative for the other 127 questions. This gives 127 free negatives per question without any extra encoder forward passes.
Dense retrieval isn't always better than BM25. The tradeoffs are clear:
| Dimension | BM25 (sparse) | DPR (dense) |
|---|---|---|
| Lexical overlap queries | Strong | Comparable |
| Semantic matching | Fails | Strong |
| Zero-shot (new domain) | Strong | Needs fine-tuning |
| Index size | Inverted index, compact | Dense vectors, ~3x larger |
| Search speed | ~1ms | ~10ms (with FAISS) |
In practice, many production systems use a hybrid approach: BM25 for initial candidate retrieval (fast, keyword-based), then a dense reranker to score the top candidates by semantic similarity. This combines the speed of sparse retrieval with the accuracy of dense matching.
The retriever learns to find documents that help THIS generator produce correct answers. This is the insight that distinguishes RAG from simple "retrieve then read" pipelines: the retriever and generator can be trained jointly, end-to-end, so the retriever learns what the generator needs.
Lewis et al. (2020) proposed two variants of RAG, both sharing the same pipeline but differing in how they marginalize over retrieved documents:
Step 1: Query encoding. The input question x is encoded by the query encoder (BERT-based DPR) into a dense vector q = EQ(x). Shape: [768].
Step 2: MIPS retrieval. The query vector is used to retrieve the top-k documents (typically k=5 or k=10) from a FAISS index containing millions of pre-computed document embeddings. This takes ~10ms.
Step 3: Concatenation. Each retrieved document di is concatenated with the original question: inputi = [x; di]. This creates k separate input sequences for the generator.
Step 4: Generation. The generator (BART-large) produces output tokens conditioned on each input. The final output marginalizes over the k documents.
Step 5: Marginalization. This is where the two variants differ.
Click "Next Step" to walk through the RAG pipeline. Watch the query get encoded, documents retrieved, concatenated, and the answer generated.
The two variants differ in when they marginalize over retrieved documents:
RAG-Sequence: For each retrieved document, generate the entire output sequence. Then pick the sequence with the highest probability (marginalized over documents). Formally:
Each document produces one complete answer. The final answer is the one with the highest total probability across all documents. Think of it as: "Get five opinions, pick the best one."
RAG-Token: At each token position, marginalize over all k documents independently. Different tokens can be "sourced" from different documents. Formally:
The sum and product are swapped. Each output token can draw from a different document. Think of it as: "For each word, consult all five sources and pick the best word."
The generator (BART) is trained with standard cross-entropy loss on the target answer tokens. The retriever gradient flows through the document retrieval step via the retrieval probability pη(z|x). In practice, the document index can't be updated every gradient step (re-indexing millions of documents is expensive), so the document encoder is updated periodically and the index is rebuilt every few hundred steps.
The training data is simply (question, answer) pairs — no document annotations needed. The retriever learns to find useful documents by backpropagating through the generation loss. If a document helps the generator produce the correct answer, the retriever learns to rank that document higher.
python # Simplified RAG-Token forward pass def rag_token_forward(query, top_k=5): # Encode query q_emb = query_encoder(query) # [768] # Retrieve top-k docs doc_scores, doc_ids = faiss_index.search(q_emb, top_k) docs = [corpus[i] for i in doc_ids] retrieval_probs = softmax(doc_scores) # [k] # Generate with each doc all_logits = [] for doc in docs: input_text = query + " [SEP] " + doc logits = generator(input_text) # [seq_len, vocab] all_logits.append(logits) # Marginalize: for each token position, # weighted sum of probs across all docs all_logits = stack(all_logits) # [k, seq_len, vocab] all_probs = softmax(all_logits, dim=-1) # retrieval_probs: [k, 1, 1] broadcast marginal = (retrieval_probs * all_probs).sum(dim=0) # [seq_len, vocab] return marginal
What if the model could think out loud, take an action, observe the result, then think again? This is ReAct (Yao et al., 2022) — a prompting framework that interleaves Reasoning (chain-of-thought) with Acting (tool calls) in a single LLM generation.
Before ReAct, there were two separate approaches to improving LLM performance. Chain-of-thought (CoT) prompting lets the model "think step by step" before answering, improving reasoning accuracy. Action-based approaches let the model call external tools (search, lookup, calculate). ReAct's insight: combine both. Let the model alternate between thinking and acting.
A ReAct trace has three types of steps, repeated in a loop:
Thought: The model reasons about what it knows and what it needs to find out. This is internal — it's CoT reasoning. Example: "I need to find who directed Inception, then find that director's birth year."
Action: The model emits a structured action string that invokes an external tool. Example: "Search[Inception film director]". The available actions are defined in the prompt — typically Search[query], Lookup[keyword], and Finish[answer].
Observation: The tool returns its result, which is appended to the context. Example: "Inception is a 2010 film directed by Christopher Nolan." The model then reads this observation and decides what to do next.
Click "Next Step" to advance through a ReAct trace solving a multi-hop question. Watch the model think, act, observe, and repeat.
Yao et al. compared three approaches on multi-hop reasoning tasks (HotpotQA) and fact verification (FEVER):
| Method | HotpotQA (EM) | FEVER (Acc) | Traces Readable? |
|---|---|---|---|
| CoT (reason only, no tools) | 29.4 | 56.3 | Yes (but can hallucinate facts) |
| Act-only (tools, no reasoning) | 25.7 | 58.9 | No (black-box actions) |
| ReAct (reason + act) | 35.1 | 64.6 | Yes (grounded by observations) |
The key finding: reasoning without acting hallucinates facts (CoT makes up details it could have looked up). Acting without reasoning makes wrong action choices (the model searches for the wrong thing because it didn't plan). ReAct gets the best of both: reasoning guides the search, and search grounds the reasoning.
ReAct is a pure prompting approach — no model fine-tuning required. You provide few-shot examples of Thought/Action/Observation traces in the prompt, and the model continues the pattern for new questions:
ReAct prompt structure
Question: Were Scott Derrickson and Ed Wood of the same nationality?
Thought 1: I need to find the nationality of Scott Derrickson
and Ed Wood, then compare.
Action 1: Search[Scott Derrickson]
Observation 1: Scott Derrickson (born July 16, 1966) is an
American filmmaker.
Thought 2: Scott Derrickson is American. Now I need to find
Ed Wood's nationality.
Action 2: Search[Ed Wood]
Observation 2: Edward Davis Wood Jr. was an American filmmaker.
Thought 3: Both are American, so yes, they share the same
nationality.
Action 3: Finish[yes]
The model learns to follow this pattern from the examples. It generates "Thought N:" to reason, "Action N:" to call a tool, and then the system injects "Observation N:" with the tool's output before letting the model continue.
One of ReAct's strengths is error recovery. If a search returns no results or irrelevant information, the model can recognize this in its next Thought step and reformulate the query. For example:
error recovery
Thought 3: The search didn't return the birth year.
Let me try a more specific query.
Action 3: Search[Christopher Nolan birth date]
Observation 3: Christopher Edward Nolan was born on
30 July 1970 in London.
Thought 4: Found it. Christopher Nolan was born in 1970.
This self-correcting behavior emerges from the few-shot examples — if you include examples where the agent recovers from bad searches, the model learns to do the same.
The model writes "CALC(347 × 892)". A calculator returns 309,324. The model continues with the correct number. This is tool use in its simplest form: the LLM emits a special token sequence that triggers an external computation, and the result is injected back into the generation.
The concept is powerful because it plays to each component's strengths. The LLM excels at understanding natural language, decomposing problems, and knowing when a calculation or lookup is needed. The tool excels at the actual computation. Together they achieve what neither can alone.
At the implementation level, tool use follows a simple protocol:
1. Generation pause. The model generates tokens normally until it produces a special tool-call token or pattern (e.g., [CALC], <tool_call>, or a JSON block). Generation pauses.
2. Argument extraction. The system parses the tool name and arguments from the generated tokens. For example, CALC(347 * 892) → tool="calculator", args="347 * 892".
3. Tool execution. The external tool runs with the provided arguments. A calculator evaluates the expression. A search engine queries the web. A Python interpreter runs the code.
4. Result injection. The tool's output is inserted into the context, typically formatted as [RESULT: 309324]. The model then continues generating from this point, with access to the correct result.
Select a tool type, then click "Generate" to watch the LLM generate tokens, pause for a tool call, execute the tool, and continue.
| Tool | Input | Output | Fixes |
|---|---|---|---|
| Calculator | Math expression | Numeric result | Arithmetic errors |
| Web search | Search query | Top results + snippets | Knowledge cutoff |
| Code interpreter | Python code | stdout + return value | Complex logic, data processing |
| Knowledge base | Structured query | Matching records | Domain-specific lookup |
| Calendar/API | API call parameters | API response | Real-world actions |
Different systems use different formats for tool calls. The trend is toward structured JSON:
OpenAI function calling format { "tool_calls": [{ "type": "function", "function": { "name": "get_weather", "arguments": "{\"location\": \"Tokyo\", \"unit\": \"celsius\"}" } }] }
Anthropic tool use format { "type": "tool_use", "name": "calculator", "input": {"expression": "347 * 892"} }
The model is trained (via SFT and RLHF) to produce these structured outputs when appropriate. The key engineering decisions:
Tool descriptions: Each tool comes with a schema (name, description, parameters, types). The model sees these descriptions in its system prompt and learns to match tasks to available tools.
Parallel vs. sequential calls: Some systems allow the model to request multiple tool calls in a single turn (e.g., search for two things simultaneously). Others require one call at a time with observation between calls.
Error handling: When a tool call fails (invalid arguments, timeout, API error), the error message is returned as the observation. Well-trained models learn to retry with corrected arguments or try a different approach.
What if the model figured out on its own when tools would help, with no human labels? Toolformer (Schick et al., 2023) does exactly this: it uses the language model's own loss to decide where tool calls should be inserted into training data, then fine-tunes the model on this self-annotated data.
The key insight: if inserting a tool call at a particular position in the text reduces the model's perplexity on subsequent tokens, that tool call is useful. If it doesn't help (or hurts), discard it. The LM loss itself becomes the label.
Toolformer works in five steps, all automated:
Step 1: Sample candidate positions. For each position in the training text, ask the model: "Would a tool call be helpful here?" Use a few-shot prompt to generate candidate tool calls at promising positions. Not every position gets a candidate — only positions where the model, prompted with examples of tool usage, generates a tool call.
Step 2: Generate candidate calls. At each sampled position, generate several candidate tool calls using the LLM itself. For a calculator tool, this might produce [CALC(15 * 7.3)]. For a search tool, [SEARCH("population of France")].
Step 3: Execute the calls. Run each candidate tool call to get the result. [CALC(15 * 7.3)] → 109.5.
Step 4: Compare perplexity. For each candidate, measure the model's perplexity on the tokens after the tool call position in two conditions: (a) with the tool call and result inserted, and (b) without. If condition (a) has significantly lower perplexity, the tool call is useful.
Step 5: Keep or discard. Only tool calls that reduce perplexity by at least a threshold τ are kept. These are inserted into the training data. The model is then fine-tuned on this augmented data, learning to generate tool calls at appropriate positions during normal text generation.
Click "Next Step" to walk through the Toolformer pipeline. Watch candidate tool calls get generated, executed, evaluated, and either kept (green) or discarded (red).
Formally, let x be the text, i be the candidate position, c be the tool call, r be the tool result, and Li be the loss on tokens after position i. The filtering criterion is:
We compare against an "empty call" (just the API syntax with no content) rather than no call at all. This ensures the benefit comes from the tool's result, not from the tool call syntax acting as a formatting cue.
Schick et al. tested Toolformer with five tools on a 6.7B parameter GPT-J model:
| Tool | Task | Without Tool | With Toolformer | Improvement |
|---|---|---|---|---|
| Calculator | Math (ASDiv) | 43.3% | 76.6% | +33.3% |
| Wikipedia search | QA (TempLAMA) | 21.0% | 40.5% | +19.5% |
| Calendar | Date tasks | 36.0% | 86.4% | +50.4% |
| Machine translation | MLQA | 26.2% | 32.1% | +5.9% |
| QA system | Knowledge QA | 34.1% | 40.8% | +6.7% |
The calculator and calendar tools provide the largest improvements because they address tasks where the LLM is fundamentally limited: exact arithmetic and temporal reasoning. Search and QA tools help less because the model already has substantial factual knowledge from pretraining — tools mainly help for facts that are rare or post-cutoff.
Toolformer has important constraints. It can only use tools that can be expressed as text-in, text-out APIs. It can only use one tool per position (no chaining). And the self-supervised filtering is noisy — some useful tool calls are discarded, and some useless ones slip through. Modern systems (GPT-4, Claude) use a different approach: tool use is baked into the RLHF training process with human feedback on when tools should be called.
You give the agent a question. Watch it think, search, calculate, and arrive at the answer. This is the showcase chapter — a full agent simulator that combines everything from the previous chapters: retrieval, tool use, and the ReAct reasoning loop.
A real agent orchestrator works like this: the LLM receives a system prompt listing its available tools (search, calculator, Python, knowledge base), a set of few-shot examples showing the Thought/Action/Observation format, and the user's question. It then enters a loop:
Select a preset task below, then step through the agent's execution. Watch the reasoning trace build step by step. The tool panel on the right lights up when a tool is called. The document panel shows retrieved content.
Select a task, then click "Step" to advance one T/A/O step, or "Auto-Run" to watch the full trace play out.
Real production agents (like those powering ChatGPT, Claude, and Gemini) use a more sophisticated version of this loop. Key engineering decisions:
Tool selection: The model sees a schema of all available tools in its system prompt. It must select the right tool, provide correctly-formatted arguments, and handle errors gracefully. This is trained during RLHF — the model learns from human preferences about which tool calls are appropriate.
Context management: As the T/A/O trace grows, the context window fills up. Production systems manage this by summarizing earlier steps, truncating irrelevant observations, or using sliding-window approaches.
Safety guardrails: Before executing a tool call, the system checks: Is this action safe? Does it access sensitive data? Could it have irreversible side effects? High-risk actions (sending emails, modifying databases) may require user confirmation.
Max iterations: A hard limit on loop iterations prevents infinite loops. If the agent hasn't produced a Finish action after N steps (typically 5-15), it's forced to produce a best-effort answer from what it has.
python # Simplified agent loop def agent_loop(question, tools, max_steps=10): context = SYSTEM_PROMPT + FEW_SHOT_EXAMPLES context += f"\nQuestion: {question}\n" for step in range(max_steps): # Generate next thought + action response = llm.generate(context, stop=["Observation:"]) context += response # Parse action action = parse_action(response) if action.tool == "Finish": return action.args # final answer # Execute tool try: result = tools[action.tool].execute(action.args) except Exception as e: result = f"Error: {e}" # Append observation context += f"\nObservation: {result}\n" return "Failed to reach answer in max steps."
The agent looks impressive. But what happens when it hallucinates a tool call that doesn't exist? What happens when it enters an infinite loop of searches that never converge? What happens when it calls the right tool with the wrong arguments? Agent failures are LLM failures amplified — because the model can now act, its mistakes have real consequences.
The simulation below shows five common failure modes. Click through each one to see the agent trace where things go wrong.
Click a failure mode to see an animated trace showing where and how the agent fails.
1. Hallucinated tool calls. The model generates a call to a tool that doesn't exist, like WeatherAPI("Tokyo") when only Search and Calculator are available. The model has seen API calls in its training data and hallucinates plausible-looking but nonexistent ones. This is especially common when the model is used with a different tool set than it was trained with.
2. Infinite loops. The model repeatedly searches for the same thing with slightly different wording, never making progress. "Search[population of France]" → observation doesn't contain the answer → "Search[France population 2024]" → same result → "Search[how many people in France]" → forever. Without a max-iteration limit, this runs until the context window fills up.
3. Wrong tool selection. The model uses a calculator when it should search, or searches when it should calculate. Example: "What is 15% of the US national debt?" — the model searches for "15% of national debt" instead of first searching for the debt amount, then calculating 15% of it.
4. Wrong result interpretation. The tool returns correct results, but the model misreads or misinterprets them. A search returns a Wikipedia article about Tokyo's population, but the model reads a number from the wrong row (metropolitan area vs. city proper) and produces an incorrect answer with high confidence.
5. Unsafe actions. The model calls tools with dangerous side effects. "Delete all files matching pattern *" or "Send email to all-company@..." or "Execute SQL: DROP TABLE users". In production, every tool call should be sandboxed and audited, with destructive actions requiring human confirmation.
| Failure Mode | Mitigation | Trade-off |
|---|---|---|
| Hallucinated APIs | Tool allowlist + schema validation | Limits flexibility; can't use new tools |
| Infinite loops | Max iteration limit + loop detection | May terminate before finding the answer |
| Wrong tool | Better tool descriptions + few-shot examples | Longer prompts consume context |
| Bad interpretation | Self-verification step ("Does this answer the question?") | Extra LLM call = latency + cost |
| Unsafe actions | Sandboxing + human-in-the-loop for writes | Slower; breaks autonomous operation |
Current agents work well on clean, well-structured tasks where the right tool and the right query are obvious. They struggle on tasks requiring:
Ambiguous queries: "Tell me about the bridge in that city" — which bridge? Which city? The model must ask clarifying questions rather than guessing.
Multi-step planning: Tasks requiring 10+ steps with dependencies between steps have rapidly compounding error rates. If each step has 90% accuracy, 10 steps yield 35% end-to-end accuracy.
Novel tool combinations: Using tools in creative ways the model hasn't seen in training. "Use the calculator output as a search query" or "Write code that calls the search API."
These challenges are active research areas. The field is moving toward better planning algorithms (tree search over action sequences), self-reflection (the model critiques its own trace), and learned verifiers (a separate model checks the agent's work).
Agents sit at the top of the LLM capability stack. Everything we've covered in earlier lectures builds toward this: pretraining gives the model language understanding (L07), post-training aligns it with human preferences and teaches it to follow instructions (L08), efficient adaptation methods let us customize it for specific domains (L09), and now agents give it the ability to act in the world.
How each lecture layer builds on the previous ones. Agents require all layers below.
| Paper | Year | Contribution |
|---|---|---|
| RAG (Lewis et al.) | 2020 | Joint retriever-generator with end-to-end training. RAG-Token vs RAG-Sequence marginalization. |
| ReAct (Yao et al.) | 2022 | Interleaved reasoning and acting. Thought/Action/Observation traces. No fine-tuning needed. |
| Toolformer (Schick et al.) | 2023 | Self-supervised tool learning via perplexity filtering. Zero human labels. |
| DPR (Karpukhin et al.) | 2020 | Dense passage retrieval with dual BERT encoders. Foundation for neural retrieval. |
| WebGPT (Nakano et al.) | 2021 | LLM that browses the web. Trained with RLHF on browsing trajectories. |
| Lesson | Connection |
|---|---|
| L08: Post-training | RLHF teaches tool selection preferences. SFT teaches tool call format. |
| L09: PEFT | LoRA lets you fine-tune agents for specific tool sets without full retraining. |
| L11: Evaluation | Agent evaluation is uniquely hard: must assess tool selection, reasoning quality, and final answer correctness. |
The field is evolving rapidly. Current research directions include:
Multi-agent systems: Multiple specialized agents collaborating on a task. One agent plans, another searches, another writes code. Coordination is the hard problem.
Long-horizon planning: Agents that can plan and execute tasks spanning hours or days, maintaining state across sessions. This requires persistent memory beyond the context window.
Self-improvement: Agents that evaluate their own performance and improve their tool-use strategies over time. The agent writes better prompts for itself based on past failures.
Multimodal agents: Agents that can see (vision), hear (audio), and manipulate (robotics). Tool use extends from text APIs to physical actuators.