A chatbot can only talk. An agent can do things — search the web, run code, call APIs, write files. This is how language models become general-purpose workers.
You ask a chatbot: "What's the weather in Tokyo right now?" It says: "I'm sorry, I don't have access to real-time data." You ask it to send an email. Same answer. You ask it to check your database. Same answer. A chatbot is a very sophisticated text predictor — but it lives in a cage. It can describe the world but it cannot touch it.
Now imagine giving it a phone. Not just words — actual phone calls it can make. "Call the weather API." It calls it. Gets the response. Reads it. Incorporates it. Answers you. That's an agent: a language model equipped with tools — functions it can invoke — that let it act in the real world.
This is not science fiction. Every major AI API (OpenAI, Anthropic, Google, Mistral) now has built-in tool-use support. Agents are used today for: customer support bots that actually look up orders, coding assistants that run tests, research assistants that search and synthesize, and infrastructure bots that query and update databases.
Before we get to the machinery, let's feel the difference viscerally. Below: a chatbot vs. an agent faced with the same task. Watch what happens.
Click Run Task to watch both handle "What is 847 × 293 and is the result a prime number?" The chatbot guesses. The agent acts.
The chatbot might get 847 × 293 wrong (LLMs are notoriously bad at arithmetic). The agent calls a calculator tool, gets the exact answer (248,171), then calls a is_prime tool, gets false, and answers confidently. The task is simple — but the lesson is profound: tools compensate for the model's weaknesses.
The model cannot directly call a function. It doesn't run code — it predicts tokens. So how does tool use work? Simple: the model outputs a specially formatted chunk of text that the host software recognizes as a function call. The host runs the function, gets the result, and feeds it back as a new message. The model never left the text domain — the host software bridges the gap.
Modern APIs make this seamless. You pass a list of tool definitions alongside your prompt. The model can either respond normally (just text) or respond with a tool call — a structured JSON object naming the function and its arguments. The API handles parsing; you never see raw JSON in normal use.
{"type": "tool_use", "name": "get_weather", "input": {"city": "Tokyo"}}. Three things: what type of response (tool_use), which function (name), and what to pass it (input). That's it. The model fills these in; you execute them.Let's trace a complete round-trip with the Anthropic API. The user asks "What's the weather in Tokyo?" The model responds not with text but with a tool call. We execute get_weather("Tokyo"), get {"temp": 22, "condition": "cloudy"}, and send that back as a tool_result message. The model then generates a natural-language answer incorporating the real data.
python import anthropic client = anthropic.Anthropic() # Define what tools are available tools = [{ "name": "get_weather", "description": "Get current weather for a city", "input_schema": { "type": "object", "properties": { "city": {"type": "string", "description": "City name"} }, "required": ["city"] } }] # Turn 1: User asks, model responds with a tool call response = client.messages.create( model="claude-opus-4-5", max_tokens=1024, tools=tools, messages=[{"role": "user", "content": "What's the weather in Tokyo?"}] ) # response.stop_reason == "tool_use" — model wants to call a tool tool_call = response.content[0] # tool_call.name == "get_weather" # tool_call.input == {"city": "Tokyo"} # We execute the function ourselves def get_weather(city): # In reality: call a weather API return {"temp": 22, "condition": "cloudy", "humidity": 68} result = get_weather(**tool_call.input) # Turn 2: feed result back, model produces final answer final = client.messages.create( model="claude-opus-4-5", max_tokens=1024, tools=tools, messages=[ {"role": "user", "content": "What's the weather in Tokyo?"}, {"role": "assistant", "content": response.content}, {"role": "user", "content": [{ "type": "tool_result", "tool_use_id": tool_call.id, "content": str(result) }]} ] ) # final.content[0].text == "Tokyo is currently 22°C and cloudy with 68% humidity."
stop_reason == "end_turn", the model is done talking. When stop_reason == "tool_use", it's waiting for you to run a function and report back. Your agent loop checks this flag on every response.Watch the messages flow between the user, model, and tool executor. Click Step to advance one message at a time.
stop_reason == "tool_use", what should your code do next?The model can only call tools it knows about. You introduce tools via their definitions — a JSON schema describing the tool's name, what it does, and what parameters it accepts. The model reads these definitions as part of its context and decides when and how to use them.
Here is the surprising truth: the description matters more than the code. The model never sees your Python implementation. It only sees the name, description, and parameter schema. If your description is vague, the model won't know when to call the tool or what to pass. If it's precise, the model will use the tool exactly as intended.
search with description "searches things" — the model doesn't know what it searches, when to use it vs. a database query, or what format the query should be in. The best implementation in the world won't help if the model can't figure out when to reach for it.A complete tool definition has four parts: (1) name — a short, unique identifier the model uses to reference the tool; (2) description — a paragraph explaining what the tool does, when to use it, and what it returns; (3) input_schema — a JSON Schema defining each parameter's type, description, and whether it's required; (4) implicitly: the output contract — describe in the tool description what the return value looks like.
python # POOR tool definition — the model will misuse this bad_tool = { "name": "search", "description": "searches things", "input_schema": { "type": "object", "properties": {"q": {"type": "string"}}, "required": ["q"] } } # GOOD tool definition — precise contract good_tool = { "name": "web_search", "description": ( "Search the web for current information not in your training data. " "Use for: recent news, current prices, live sports scores, today's weather, " "or any fact that may have changed after 2024. Returns a list of up to 5 " "results, each with 'title', 'url', and 'snippet' fields. " "Prefer specific queries over vague ones." ), "input_schema": { "type": "object", "properties": { "query": { "type": "string", "description": "The search query. Use natural language or keywords." }, "num_results": { "type": "integer", "description": "Number of results to return (1-5). Default 3.", "default": 3 } }, "required": ["query"] } }
Parameter schemas follow JSON Schema. The types you'll use 90% of the time: string, integer, number, boolean, array, object. Mark required parameters in the required array. Use enum to constrain to specific values.
Explore different tool quality levels. See how a vague vs. precise definition changes what the model understands about the tool.
| Element | Purpose | Common mistake |
|---|---|---|
name | Model uses this to call the tool | Too generic (tool1), or snake_case inconsistent |
description | Tells the model WHEN and WHY to use it | Too short, missing return value format |
param description | Tells the model WHAT to pass | Omitted entirely — model has to guess |
required | Prevents model from omitting critical args | All params optional — model skips them |
enum | Constrains to valid values | Free string when only 3 options exist |
An agent that just calls tools is a fancy function dispatcher. What makes agents powerful is reasoning between actions. The ReAct pattern (Reasoning + Acting) alternates explicit thought steps with tool calls: the model first writes what it's thinking, then decides what to do, then observes the result, then thinks again.
Why does this help? Two reasons. First, thinking out loud helps the model plan — it can notice when a previous step's result changes what it should do next. Second, you (the developer) can read the thoughts and debug what went wrong. Pure tool-calling with no reasoning is a black box. ReAct gives you a trace.
Here is a complete ReAct trace for the task: "How many days until the next US federal holiday from today (May 22, 2026)?"
python # ReAct with Anthropic's extended thinking # The model's "thoughts" appear in thinking blocks before tool calls import anthropic client = anthropic.Anthropic() def react_agent(task, tools, tool_fns, max_steps=10): messages = [{"role": "user", "content": task}] for step in range(max_steps): response = client.messages.create( model="claude-opus-4-5", max_tokens=4096, tools=tools, messages=messages, # system prompt instructs model to think before acting system="Think step by step before calling any tool. " "After each tool result, reason about what you learned." ) # Append model's response to history messages.append({"role": "assistant", "content": response.content}) if response.stop_reason == "end_turn": # Done — extract final text answer return next(b.text for b in response.content if hasattr(b, 'text')) # Execute all tool calls in this turn tool_results = [] for block in response.content: if block.type == "tool_use": fn = tool_fns[block.name] result = fn(**block.input) tool_results.append({ "type": "tool_result", "tool_use_id": block.id, "content": str(result) }) messages.append({"role": "user", "content": tool_results}) return "Max steps reached"
Animate a ReAct trace. Each step shows the Thought, Action, and Observation cycle. Click Next Step to advance.
A single tool call solves simple tasks. Real tasks require chains: search → read → extract → compute → write. Each step depends on the previous. The model decides at each juncture what to do next based on what it learned. This is the agent loop.
The agent loop has exactly four steps, repeated until done: (1) Generate — call the model with the current message history; (2) Parse — check if the response contains a tool call; (3) Execute — if yes, run the tool; (4) Feed back — append the tool result to history and go to step 1. Stop when the model's stop_reason is end_turn and produces a final text answer.
Two hard problems arise in multi-step agents. First: when to stop. The model might keep calling tools unnecessarily — "just one more search" — consuming tokens and time. Set a max_steps limit and enforce it. Second: token budget. Every tool result gets appended to the conversation. After 10 tool calls, you might have 40,000 tokens of history. Plan for this.
python import anthropic from typing import Callable def agent_loop( task: str, tools: list, tool_fns: dict[str, Callable], max_steps: int = 15, max_tokens_per_step: int = 4096, ) -> str: """Complete agent loop. Returns final text answer.""" client = anthropic.Anthropic() messages = [{"role": "user", "content": task}] step = 0 total_tokens = 0 while step < max_steps: step += 1 response = client.messages.create( model="claude-opus-4-5", max_tokens=max_tokens_per_step, tools=tools, messages=messages, ) total_tokens += response.usage.input_tokens + response.usage.output_tokens messages.append({"role": "assistant", "content": response.content}) if response.stop_reason == "end_turn": return next( b.text for b in response.content if hasattr(b, 'text') ) # Collect all tool calls in this turn (model may call multiple) tool_results = [] for block in response.content: if block.type != "tool_use": continue if block.name not in tool_fns: result = {"error": f"Unknown tool: {block.name}"} else: try: result = tool_fns[block.name](**block.input) except Exception as e: result = {"error": str(e)} tool_results.append({ "type": "tool_result", "tool_use_id": block.id, "content": str(result) }) messages.append({"role": "user", "content": tool_results}) # Log token budget print(f"Step {step}: {total_tokens} tokens used so far") return "Max steps reached without final answer."
Watch the agent loop execute. The token counter grows with each step. Set the step limit and see what happens when it's too low.
Every call to the model is stateless. The model has no memory between API calls — it only knows what you put in the messages array right now. State is your job. You manage it. You decide what to include, what to drop, and what to summarize when the context fills up.
There are three kinds of state in an agent: (1) Conversation history — the raw messages array. Grows linearly with every turn. Simple but gets expensive fast. (2) Working memory — key facts extracted from tool results and stored separately, injected into the system prompt. Controlled size, but requires curation. (3) External memory — a database or vector store the agent can query. Unlimited capacity, but requires a retrieval tool.
When the conversation grows too long, you have three options: (1) Truncate — drop the oldest messages. Simple, but you lose information. (2) Summarize — call the model to compress old messages into a summary, replace them. More tokens upfront, saves more later. (3) Retrieve — store facts in a vector DB, inject only what's relevant to the current step. Best quality, most engineering.
python # Strategy 1: Simple truncation (keep last N turns) def truncate_history(messages, keep_last=20): # Always keep the first message (original task) if len(messages) <= keep_last: return messages return [messages[0]] + messages[-(keep_last - 1):] # Strategy 2: Summarize old history def summarize_history(messages, client, keep_recent=6): if len(messages) <= keep_recent + 2: return messages old_messages = messages[1:-keep_recent] recent_messages = messages[-keep_recent:] summary_response = client.messages.create( model="claude-haiku-4-5", # Use cheap model for summarization max_tokens=512, messages=[{ "role": "user", "content": f"Summarize what happened in this agent conversation:\n{old_messages}" }] ) summary = summary_response.content[0].text # Replace old history with summary + keep recent return [ messages[0], # original task {"role": "user", "content": f"[Previous steps summary]: {summary}"}, *recent_messages ] # Strategy 3: Working memory dict (injected into system prompt) working_memory = {} def update_memory(key, value): working_memory[key] = value def build_system_prompt(): mem_str = "\n".join(f"- {k}: {v}" for k, v in working_memory.items()) return f"You are a research agent.\n\nKnown facts:\n{mem_str}"
Watch how the context window fills across agent steps. Toggle strategies to see how truncation vs. summarization affects what's retained.
Tools fail. The weather API times out. The database returns an empty result. The search engine returns a CAPTCHA page. A file doesn't exist. A calculation overflows. A robust agent handles all of these gracefully — it doesn't crash, it doesn't silently produce wrong answers, and it doesn't get stuck in a loop retrying the same broken call forever.
Errors fall into four categories: (1) Transient — the tool would succeed on retry (network timeout, rate limit). Retry with exponential backoff. (2) Permanent — the tool will never succeed with these inputs (file not found, invalid argument). Don't retry; try a different approach. (3) Partial — the tool returned something but it's incomplete or unexpected (empty result set, truncated response). The model should acknowledge and adapt. (4) Ambiguous — you're not sure which category it is. Ask for clarification or use a fallback.
python import time import requests from typing import Any # Transient error: retry with exponential backoff def robust_web_search(query: str, max_retries: int = 3) -> dict: for attempt in range(max_retries): try: response = requests.get( "https://api.search.example/search", params={"q": query}, timeout=5 ) response.raise_for_status() return response.json() except requests.Timeout: wait = 2 ** attempt # 1s, 2s, 4s if attempt < max_retries - 1: time.sleep(wait) else: return {"error": "Search timed out after 3 attempts", "retry_after": 30} except requests.HTTPError as e: if e.response.status_code == 429: # rate limited time.sleep(10) else: return {"error": f"HTTP {e.response.status_code}: permanent failure"} # The model sees structured errors and adapts # GOOD error message to send back to model: { "error": "file_not_found", "path": "/data/report.csv", "suggestion": "File does not exist. Available files: report_2025.csv, report_2024.csv" } # BAD error message (Python exception raw text): # "FileNotFoundError: [Errno 2] No such file or directory: '/data/report.csv'" # Model can technically parse this but structured errors are cleaner
error, reason, and suggestion keys is trivial.Inject different errors and watch how the agent responds. Toggle error injection to see retry logic and fallback strategies.
An agent that can do anything will eventually do the wrong thing. Agents have failed in production by: deleting files they weren't supposed to touch, spending unlimited API budget on a runaway loop, sending emails to the wrong recipients, and leaking data to unauthorized endpoints. Guardrails prevent these failures before they happen.
The four pillars of agent safety: (1) Loop prevention — maximum step count, detection of repeated identical tool calls. (2) Budget limits — cap on total API cost, number of tool calls, time elapsed. (3) Human-in-the-loop — for dangerous actions (send, delete, pay), require explicit confirmation. (4) Sandboxing — run agent code in isolated environments; restrict filesystem access, outbound network calls, and command execution.
python from dataclasses import dataclass, field from typing import Callable import time @dataclass class AgentGuardrails: max_steps: int = 15 max_tool_calls: int = 50 max_cost_usd: float = 1.0 timeout_seconds: float = 120 dangerous_tools: set = field(default_factory=lambda: { 'send_email', 'delete_file', 'execute_sql_write', 'charge_card' }) require_confirmation: bool = True # Internal counters _steps: int = 0 _tool_calls: int = 0 _cost: float = 0.0 _start_time: float = field(default_factory=time.time) _last_calls: list = field(default_factory=list) def check(self) -> str | None: """Returns error string if any limit exceeded, else None.""" if self._steps >= self.max_steps: return f"Max steps ({self.max_steps}) reached" if self._tool_calls >= self.max_tool_calls: return f"Max tool calls ({self.max_tool_calls}) reached" if self._cost >= self.max_cost_usd: return f"Cost limit (${self.max_cost_usd}) reached" if time.time() - self._start_time > self.timeout_seconds: return f"Timeout ({self.timeout_seconds}s) exceeded" return None def detect_loop(self, tool_name: str, inputs: dict) -> bool: """Detect if the model is calling the same thing over and over.""" call_sig = (tool_name, str(inputs)) self._last_calls.append(call_sig) if len(self._last_calls) > 5: self._last_calls.pop(0) # If last 3 calls are identical — it's a loop return len(self._last_calls) >= 3 and len(set(self._last_calls[-3:])) == 1 def requires_confirmation(self, tool_name: str) -> bool: return self.require_confirmation and tool_name in self.dangerous_tools
Sandboxing means the agent code cannot escape its designated environment. Concretely: (1) File access is limited to a designated working directory — no access to /etc, ~/.ssh, or credentials files. (2) Network access only to an allow-listed set of domains. (3) Code execution runs in a container with no persistent state and a CPU/memory limit. (4) API keys for external services have scoped permissions — read-only where possible.
Watch an agent loop run and see which guardrails trip. Adjust the limits and observe the difference.
send_email repeatedly because its previous call seemed to fail. What guardrail would catch this?This is the payoff. A fully interactive agent running in your browser — no API key needed. The agent has three tools: a calculator, a web search (simulated), and a file reader (reads from a mini in-memory filesystem). Choose a task, watch it reason step by step, inject errors, and cap the tool budget.
You've built the mental model: function calling → tool definitions → ReAct → multi-step loops → state management → error handling → guardrails. Now let's zoom out and see how this lesson connects to the wider landscape of agent technology.
The tool definitions you write today are proprietary — each API has its own JSON format. MCP (Anthropic's Model Context Protocol) standardizes this: tools are described once and work across any MCP-compatible client. Think of it as USB for agent tools. Instead of writing custom integrations for Claude, GPT-4, and Gemini separately, you write an MCP server once. This is the direction the industry is heading.
RAG is agents applied to knowledge retrieval. The "retrieve" step is just a tool call to a vector database. The model asks: "Search my docs for information about X." The retrieval tool returns relevant chunks. The model synthesizes. What makes it an agent pattern — not just a two-step pipeline — is that a ReAct agent can retrieve multiple times, follow up on what it finds, and know when it has enough context to answer.
Autonomous workflows run without a human in the loop — the agent reads a task queue, executes tasks, writes results, and moves to the next. This requires robust error handling, comprehensive logging, and clear stopping criteria. The engineering is identical to what you've learned — the difference is operational: nobody is watching, so every failure mode must be handled in code, not handed to a human.
The most powerful systems have multiple agents collaborating. An orchestrator agent breaks down a complex task and delegates subtasks to worker agents, each with specialized tool sets. The orchestrator collects results and synthesizes. This is architecturally identical to distributed systems: each agent is a service, the orchestrator is an API gateway, and the message history is the shared data store.
Click a concept to see how it connects to the agent fundamentals from this lesson.
| Concept | What you learned here | What's next |
|---|---|---|
| Function calling | Single tool round-trip | Evaluating agent accuracy |
| Tool definitions | JSON Schema for tools | MCP server spec, OpenAPI |
| ReAct loop | Thought/Action/Observation | Tree-of-thought, MCTS planning |
| State management | Truncation + summarization | Long-term memory, vector stores |
| Guardrails | Loop detection, budgets | Constitutional AI, policy models |