AI Agents & Tool Use

Build an Agent
From Scratch

A chatbot can only talk. An agent can do things — search the web, run code, call APIs, write files. This is how language models become general-purpose workers.

Prerequisites: Basic Python + have used a chat API. That's it.
10
Chapters
8+
Simulations
0
Assumed Knowledge

Chapter 0: Why Agents?

You ask a chatbot: "What's the weather in Tokyo right now?" It says: "I'm sorry, I don't have access to real-time data." You ask it to send an email. Same answer. You ask it to check your database. Same answer. A chatbot is a very sophisticated text predictor — but it lives in a cage. It can describe the world but it cannot touch it.

Now imagine giving it a phone. Not just words — actual phone calls it can make. "Call the weather API." It calls it. Gets the response. Reads it. Incorporates it. Answers you. That's an agent: a language model equipped with tools — functions it can invoke — that let it act in the real world.

The key shift: A chatbot's output is text. An agent's output is actions. The text is still there — but now some of that text is structured commands that the host software interprets and executes. The results come back, and the model continues. Chat becomes a control loop.

This is not science fiction. Every major AI API (OpenAI, Anthropic, Google, Mistral) now has built-in tool-use support. Agents are used today for: customer support bots that actually look up orders, coding assistants that run tests, research assistants that search and synthesize, and infrastructure bots that query and update databases.

Before we get to the machinery, let's feel the difference viscerally. Below: a chatbot vs. an agent faced with the same task. Watch what happens.

Chatbot vs. Agent — Same Task

Click Run Task to watch both handle "What is 847 × 293 and is the result a prime number?" The chatbot guesses. The agent acts.

The chatbot might get 847 × 293 wrong (LLMs are notoriously bad at arithmetic). The agent calls a calculator tool, gets the exact answer (248,171), then calls a is_prime tool, gets false, and answers confidently. The task is simple — but the lesson is profound: tools compensate for the model's weaknesses.

The four superpowers tools give a model: (1) Real-time data — news, prices, weather, APIs. (2) Exact computation — arithmetic, code execution, databases. (3) Persistent memory — file reads/writes, vector stores. (4) Side effects — sending emails, calling services, triggering workflows. Without tools, a model is read-only. With tools, it can change the world.
Why does an agent reliably succeed at arithmetic while a chatbot often fails?

Chapter 1: Function Calling — How Models Issue Commands

The model cannot directly call a function. It doesn't run code — it predicts tokens. So how does tool use work? Simple: the model outputs a specially formatted chunk of text that the host software recognizes as a function call. The host runs the function, gets the result, and feeds it back as a new message. The model never left the text domain — the host software bridges the gap.

Modern APIs make this seamless. You pass a list of tool definitions alongside your prompt. The model can either respond normally (just text) or respond with a tool call — a structured JSON object naming the function and its arguments. The API handles parsing; you never see raw JSON in normal use.

The anatomy of a tool call: {"type": "tool_use", "name": "get_weather", "input": {"city": "Tokyo"}}. Three things: what type of response (tool_use), which function (name), and what to pass it (input). That's it. The model fills these in; you execute them.

Let's trace a complete round-trip with the Anthropic API. The user asks "What's the weather in Tokyo?" The model responds not with text but with a tool call. We execute get_weather("Tokyo"), get {"temp": 22, "condition": "cloudy"}, and send that back as a tool_result message. The model then generates a natural-language answer incorporating the real data.

python
import anthropic

client = anthropic.Anthropic()

# Define what tools are available
tools = [{
    "name": "get_weather",
    "description": "Get current weather for a city",
    "input_schema": {
        "type": "object",
        "properties": {
            "city": {"type": "string", "description": "City name"}
        },
        "required": ["city"]
    }
}]

# Turn 1: User asks, model responds with a tool call
response = client.messages.create(
    model="claude-opus-4-5",
    max_tokens=1024,
    tools=tools,
    messages=[{"role": "user", "content": "What's the weather in Tokyo?"}]
)

# response.stop_reason == "tool_use" — model wants to call a tool
tool_call = response.content[0]
# tool_call.name == "get_weather"
# tool_call.input == {"city": "Tokyo"}

# We execute the function ourselves
def get_weather(city):
    # In reality: call a weather API
    return {"temp": 22, "condition": "cloudy", "humidity": 68}

result = get_weather(**tool_call.input)

# Turn 2: feed result back, model produces final answer
final = client.messages.create(
    model="claude-opus-4-5",
    max_tokens=1024,
    tools=tools,
    messages=[
        {"role": "user", "content": "What's the weather in Tokyo?"},
        {"role": "assistant", "content": response.content},
        {"role": "user", "content": [{
            "type": "tool_result",
            "tool_use_id": tool_call.id,
            "content": str(result)
        }]}
    ]
)
# final.content[0].text == "Tokyo is currently 22°C and cloudy with 68% humidity."
stop_reason is the signal: When stop_reason == "end_turn", the model is done talking. When stop_reason == "tool_use", it's waiting for you to run a function and report back. Your agent loop checks this flag on every response.
Function Call Round-Trip Visualizer

Watch the messages flow between the user, model, and tool executor. Click Step to advance one message at a time.

When a model's response has stop_reason == "tool_use", what should your code do next?

Chapter 2: Tool Definitions — Teaching the Model What It Can Do

The model can only call tools it knows about. You introduce tools via their definitions — a JSON schema describing the tool's name, what it does, and what parameters it accepts. The model reads these definitions as part of its context and decides when and how to use them.

Here is the surprising truth: the description matters more than the code. The model never sees your Python implementation. It only sees the name, description, and parameter schema. If your description is vague, the model won't know when to call the tool or what to pass. If it's precise, the model will use the tool exactly as intended.

Bad description, good code: A tool named search with description "searches things" — the model doesn't know what it searches, when to use it vs. a database query, or what format the query should be in. The best implementation in the world won't help if the model can't figure out when to reach for it.

A complete tool definition has four parts: (1) name — a short, unique identifier the model uses to reference the tool; (2) description — a paragraph explaining what the tool does, when to use it, and what it returns; (3) input_schema — a JSON Schema defining each parameter's type, description, and whether it's required; (4) implicitly: the output contract — describe in the tool description what the return value looks like.

python
# POOR tool definition — the model will misuse this
bad_tool = {
    "name": "search",
    "description": "searches things",
    "input_schema": {
        "type": "object",
        "properties": {"q": {"type": "string"}},
        "required": ["q"]
    }
}

# GOOD tool definition — precise contract
good_tool = {
    "name": "web_search",
    "description": (
        "Search the web for current information not in your training data. "
        "Use for: recent news, current prices, live sports scores, today's weather, "
        "or any fact that may have changed after 2024. Returns a list of up to 5 "
        "results, each with 'title', 'url', and 'snippet' fields. "
        "Prefer specific queries over vague ones."
    ),
    "input_schema": {
        "type": "object",
        "properties": {
            "query": {
                "type": "string",
                "description": "The search query. Use natural language or keywords."
            },
            "num_results": {
                "type": "integer",
                "description": "Number of results to return (1-5). Default 3.",
                "default": 3
            }
        },
        "required": ["query"]
    }
}
The rule of thumb: Write tool descriptions as if explaining to a smart intern who has never seen your codebase. They need to know: what does this do? when should I reach for it? what should I pass? what will I get back? Four questions. Four answers. One good description.

Parameter schemas follow JSON Schema. The types you'll use 90% of the time: string, integer, number, boolean, array, object. Mark required parameters in the required array. Use enum to constrain to specific values.

Tool Definition Quality Scorer

Explore different tool quality levels. See how a vague vs. precise definition changes what the model understands about the tool.

ElementPurposeCommon mistake
nameModel uses this to call the toolToo generic (tool1), or snake_case inconsistent
descriptionTells the model WHEN and WHY to use itToo short, missing return value format
param descriptionTells the model WHAT to passOmitted entirely — model has to guess
requiredPrevents model from omitting critical argsAll params optional — model skips them
enumConstrains to valid valuesFree string when only 3 options exist
Your tool description says "gets data." The model keeps calling it with wrong arguments. What's the most likely fix?

Chapter 3: The ReAct Pattern — Thinking and Doing Together

An agent that just calls tools is a fancy function dispatcher. What makes agents powerful is reasoning between actions. The ReAct pattern (Reasoning + Acting) alternates explicit thought steps with tool calls: the model first writes what it's thinking, then decides what to do, then observes the result, then thinks again.

Why does this help? Two reasons. First, thinking out loud helps the model plan — it can notice when a previous step's result changes what it should do next. Second, you (the developer) can read the thoughts and debug what went wrong. Pure tool-calling with no reasoning is a black box. ReAct gives you a trace.

ReAct cycle: Thought → what do I know, what do I need? → Action → call this tool with these args → Observation → here's what the tool returned → repeat until done → Answer.

Here is a complete ReAct trace for the task: "How many days until the next US federal holiday from today (May 22, 2026)?"

TASK: "How many days until the next US federal holiday?"
Thought: I need today's date and the list of upcoming federal holidays. I'll get today's date first.
Action: get_current_date()
Observation: {"date": "2026-05-22", "day_of_week": "Friday"}
Thought: Today is May 22, 2026. Memorial Day is always the last Monday of May. In 2026, May has 31 days. The last Monday would be May 25. Let me verify.
Action: get_federal_holidays(year=2026)
Observation: [{"name": "Memorial Day", "date": "2026-05-25"}, {"name": "Juneteenth", "date": "2026-06-19"}, ...]
Thought: Memorial Day is 2026-05-25. From May 22 to May 25 is 3 days. That's the next holiday.
Answer: Memorial Day is in 3 days, on Monday May 25, 2026.
python
# ReAct with Anthropic's extended thinking
# The model's "thoughts" appear in thinking blocks before tool calls

import anthropic

client = anthropic.Anthropic()

def react_agent(task, tools, tool_fns, max_steps=10):
    messages = [{"role": "user", "content": task}]

    for step in range(max_steps):
        response = client.messages.create(
            model="claude-opus-4-5",
            max_tokens=4096,
            tools=tools,
            messages=messages,
            # system prompt instructs model to think before acting
            system="Think step by step before calling any tool. "
                   "After each tool result, reason about what you learned."
        )

        # Append model's response to history
        messages.append({"role": "assistant", "content": response.content})

        if response.stop_reason == "end_turn":
            # Done — extract final text answer
            return next(b.text for b in response.content
                         if hasattr(b, 'text'))

        # Execute all tool calls in this turn
        tool_results = []
        for block in response.content:
            if block.type == "tool_use":
                fn = tool_fns[block.name]
                result = fn(**block.input)
                tool_results.append({
                    "type": "tool_result",
                    "tool_use_id": block.id,
                    "content": str(result)
                })

        messages.append({"role": "user", "content": tool_results})

    return "Max steps reached"
ReAct Step Tracer

Animate a ReAct trace. Each step shows the Thought, Action, and Observation cycle. Click Next Step to advance.

What advantage does ReAct have over a simple "call tools then answer" approach?

Chapter 4: Multi-Step Agents — Chains of Tool Calls

A single tool call solves simple tasks. Real tasks require chains: search → read → extract → compute → write. Each step depends on the previous. The model decides at each juncture what to do next based on what it learned. This is the agent loop.

The agent loop has exactly four steps, repeated until done: (1) Generate — call the model with the current message history; (2) Parse — check if the response contains a tool call; (3) Execute — if yes, run the tool; (4) Feed back — append the tool result to history and go to step 1. Stop when the model's stop_reason is end_turn and produces a final text answer.

Generate
Call model with full message history → get response
Parse
Is stop_reason "tool_use"? Extract tool name + inputs
↓ (yes)     ↓ (no → done)
Execute
Run the function, capture result or error
Feed back
Append tool_result message → loop back to Generate

Two hard problems arise in multi-step agents. First: when to stop. The model might keep calling tools unnecessarily — "just one more search" — consuming tokens and time. Set a max_steps limit and enforce it. Second: token budget. Every tool result gets appended to the conversation. After 10 tool calls, you might have 40,000 tokens of history. Plan for this.

python
import anthropic
from typing import Callable

def agent_loop(
    task: str,
    tools: list,
    tool_fns: dict[str, Callable],
    max_steps: int = 15,
    max_tokens_per_step: int = 4096,
) -> str:
    """Complete agent loop. Returns final text answer."""
    client = anthropic.Anthropic()
    messages = [{"role": "user", "content": task}]
    step = 0
    total_tokens = 0

    while step < max_steps:
        step += 1
        response = client.messages.create(
            model="claude-opus-4-5",
            max_tokens=max_tokens_per_step,
            tools=tools,
            messages=messages,
        )
        total_tokens += response.usage.input_tokens + response.usage.output_tokens
        messages.append({"role": "assistant", "content": response.content})

        if response.stop_reason == "end_turn":
            return next(
                b.text for b in response.content if hasattr(b, 'text')
            )

        # Collect all tool calls in this turn (model may call multiple)
        tool_results = []
        for block in response.content:
            if block.type != "tool_use":
                continue
            if block.name not in tool_fns:
                result = {"error": f"Unknown tool: {block.name}"}
            else:
                try:
                    result = tool_fns[block.name](**block.input)
                except Exception as e:
                    result = {"error": str(e)}

            tool_results.append({
                "type": "tool_result",
                "tool_use_id": block.id,
                "content": str(result)
            })

        messages.append({"role": "user", "content": tool_results})
        # Log token budget
        print(f"Step {step}: {total_tokens} tokens used so far")

    return "Max steps reached without final answer."
Token budget math: Assume each tool result averages 500 tokens and each model response is 1000 tokens. At step 10 your input context includes 10 × 1500 = 15,000 tokens of history plus the original task. At 15 steps you're at 22,500 tokens. Claude's 200k context gives headroom — but at $15/MTok for output tokens, 15 steps × 1000 output tokens = $0.22 per task. Multiply by 1,000 tasks/day = $220/day. Budget matters.
Agent Loop Animator

Watch the agent loop execute. The token counter grows with each step. Set the step limit and see what happens when it's too low.

Max steps6
In the agent loop, what happens right after a tool call is executed?

Chapter 5: State Management — Memory Across Turns

Every call to the model is stateless. The model has no memory between API calls — it only knows what you put in the messages array right now. State is your job. You manage it. You decide what to include, what to drop, and what to summarize when the context fills up.

There are three kinds of state in an agent: (1) Conversation history — the raw messages array. Grows linearly with every turn. Simple but gets expensive fast. (2) Working memory — key facts extracted from tool results and stored separately, injected into the system prompt. Controlled size, but requires curation. (3) External memory — a database or vector store the agent can query. Unlimited capacity, but requires a retrieval tool.

The context window is a sliding spotlight. Claude has a 200k token context. That sounds enormous until you have a 10-step agent with 500-token tool results: 10 × (1000 prompt + 500 result) = 15,000 tokens per task. At 100 parallel tasks with 20 steps each, you want every token to count. Stale early history is wasted context.

When the conversation grows too long, you have three options: (1) Truncate — drop the oldest messages. Simple, but you lose information. (2) Summarize — call the model to compress old messages into a summary, replace them. More tokens upfront, saves more later. (3) Retrieve — store facts in a vector DB, inject only what's relevant to the current step. Best quality, most engineering.

python
# Strategy 1: Simple truncation (keep last N turns)
def truncate_history(messages, keep_last=20):
    # Always keep the first message (original task)
    if len(messages) <= keep_last:
        return messages
    return [messages[0]] + messages[-(keep_last - 1):]

# Strategy 2: Summarize old history
def summarize_history(messages, client, keep_recent=6):
    if len(messages) <= keep_recent + 2:
        return messages

    old_messages = messages[1:-keep_recent]
    recent_messages = messages[-keep_recent:]

    summary_response = client.messages.create(
        model="claude-haiku-4-5",  # Use cheap model for summarization
        max_tokens=512,
        messages=[{
            "role": "user",
            "content": f"Summarize what happened in this agent conversation:\n{old_messages}"
        }]
    )
    summary = summary_response.content[0].text

    # Replace old history with summary + keep recent
    return [
        messages[0],  # original task
        {"role": "user", "content": f"[Previous steps summary]: {summary}"},
        *recent_messages
    ]

# Strategy 3: Working memory dict (injected into system prompt)
working_memory = {}

def update_memory(key, value):
    working_memory[key] = value

def build_system_prompt():
    mem_str = "\n".join(f"- {k}: {v}" for k, v in working_memory.items())
    return f"You are a research agent.\n\nKnown facts:\n{mem_str}"
Context Window Fill Rate

Watch how the context window fills across agent steps. Toggle strategies to see how truncation vs. summarization affects what's retained.

Why can't the model remember facts from a previous API call automatically?

Chapter 6: Error Recovery — When Tools Fail

Tools fail. The weather API times out. The database returns an empty result. The search engine returns a CAPTCHA page. A file doesn't exist. A calculation overflows. A robust agent handles all of these gracefully — it doesn't crash, it doesn't silently produce wrong answers, and it doesn't get stuck in a loop retrying the same broken call forever.

Errors fall into four categories: (1) Transient — the tool would succeed on retry (network timeout, rate limit). Retry with exponential backoff. (2) Permanent — the tool will never succeed with these inputs (file not found, invalid argument). Don't retry; try a different approach. (3) Partial — the tool returned something but it's incomplete or unexpected (empty result set, truncated response). The model should acknowledge and adapt. (4) Ambiguous — you're not sure which category it is. Ask for clarification or use a fallback.

python
import time
import requests
from typing import Any

# Transient error: retry with exponential backoff
def robust_web_search(query: str, max_retries: int = 3) -> dict:
    for attempt in range(max_retries):
        try:
            response = requests.get(
                "https://api.search.example/search",
                params={"q": query},
                timeout=5
            )
            response.raise_for_status()
            return response.json()
        except requests.Timeout:
            wait = 2 ** attempt  # 1s, 2s, 4s
            if attempt < max_retries - 1:
                time.sleep(wait)
            else:
                return {"error": "Search timed out after 3 attempts",
                        "retry_after": 30}
        except requests.HTTPError as e:
            if e.response.status_code == 429:  # rate limited
                time.sleep(10)
            else:
                return {"error": f"HTTP {e.response.status_code}: permanent failure"}

# The model sees structured errors and adapts
# GOOD error message to send back to model:
{
    "error": "file_not_found",
    "path": "/data/report.csv",
    "suggestion": "File does not exist. Available files: report_2025.csv, report_2024.csv"
}
# BAD error message (Python exception raw text):
# "FileNotFoundError: [Errno 2] No such file or directory: '/data/report.csv'"
# Model can technically parse this but structured errors are cleaner
Error messages are model prompts. When you return an error to the model as a tool_result, the model reads that text and decides what to do next. A good error message tells the model: what failed, why it failed, and ideally what alternatives exist. A raw Python stack trace is hard to parse; a structured JSON with error, reason, and suggestion keys is trivial.
Error Recovery Simulator

Inject different errors and watch how the agent responds. Toggle error injection to see retry logic and fallback strategies.

A file-not-found error occurs. The model retries the same file path 5 times. What went wrong?

Chapter 7: Guardrails — Keeping Agents Safe

An agent that can do anything will eventually do the wrong thing. Agents have failed in production by: deleting files they weren't supposed to touch, spending unlimited API budget on a runaway loop, sending emails to the wrong recipients, and leaking data to unauthorized endpoints. Guardrails prevent these failures before they happen.

The four pillars of agent safety: (1) Loop prevention — maximum step count, detection of repeated identical tool calls. (2) Budget limits — cap on total API cost, number of tool calls, time elapsed. (3) Human-in-the-loop — for dangerous actions (send, delete, pay), require explicit confirmation. (4) Sandboxing — run agent code in isolated environments; restrict filesystem access, outbound network calls, and command execution.

python
from dataclasses import dataclass, field
from typing import Callable
import time

@dataclass
class AgentGuardrails:
    max_steps: int = 15
    max_tool_calls: int = 50
    max_cost_usd: float = 1.0
    timeout_seconds: float = 120
    dangerous_tools: set = field(default_factory=lambda: {
        'send_email', 'delete_file', 'execute_sql_write', 'charge_card'
    })
    require_confirmation: bool = True

    # Internal counters
    _steps: int = 0
    _tool_calls: int = 0
    _cost: float = 0.0
    _start_time: float = field(default_factory=time.time)
    _last_calls: list = field(default_factory=list)

    def check(self) -> str | None:
        """Returns error string if any limit exceeded, else None."""
        if self._steps >= self.max_steps:
            return f"Max steps ({self.max_steps}) reached"
        if self._tool_calls >= self.max_tool_calls:
            return f"Max tool calls ({self.max_tool_calls}) reached"
        if self._cost >= self.max_cost_usd:
            return f"Cost limit (${self.max_cost_usd}) reached"
        if time.time() - self._start_time > self.timeout_seconds:
            return f"Timeout ({self.timeout_seconds}s) exceeded"
        return None

    def detect_loop(self, tool_name: str, inputs: dict) -> bool:
        """Detect if the model is calling the same thing over and over."""
        call_sig = (tool_name, str(inputs))
        self._last_calls.append(call_sig)
        if len(self._last_calls) > 5:
            self._last_calls.pop(0)
        # If last 3 calls are identical — it's a loop
        return len(self._last_calls) >= 3 and len(set(self._last_calls[-3:])) == 1

    def requires_confirmation(self, tool_name: str) -> bool:
        return self.require_confirmation and tool_name in self.dangerous_tools
The human-in-the-loop pattern: Before executing any dangerous tool (delete, send, pay), pause the loop and present the proposed action to a human. Only proceed on explicit approval. This is not optional for production agents that touch real data. A runaway agent that sends 10,000 emails is not a hypothetical — it has happened.

Sandboxing means the agent code cannot escape its designated environment. Concretely: (1) File access is limited to a designated working directory — no access to /etc, ~/.ssh, or credentials files. (2) Network access only to an allow-listed set of domains. (3) Code execution runs in a container with no persistent state and a CPU/memory limit. (4) API keys for external services have scoped permissions — read-only where possible.

Guardrail Trip Visualizer

Watch an agent loop run and see which guardrails trip. Adjust the limits and observe the difference.

Step limit6
An agent calls send_email repeatedly because its previous call seemed to fail. What guardrail would catch this?

Chapter 8: Interactive Agent Simulator

This is the payoff. A fully interactive agent running in your browser — no API key needed. The agent has three tools: a calculator, a web search (simulated), and a file reader (reads from a mini in-memory filesystem). Choose a task, watch it reason step by step, inject errors, and cap the tool budget.

Agent Simulator
5 tool calls

Chapter 9: Connections — Where Agents Fit

You've built the mental model: function calling → tool definitions → ReAct → multi-step loops → state management → error handling → guardrails. Now let's zoom out and see how this lesson connects to the wider landscape of agent technology.

Agents → MCP (Model Context Protocol)

The tool definitions you write today are proprietary — each API has its own JSON format. MCP (Anthropic's Model Context Protocol) standardizes this: tools are described once and work across any MCP-compatible client. Think of it as USB for agent tools. Instead of writing custom integrations for Claude, GPT-4, and Gemini separately, you write an MCP server once. This is the direction the industry is heading.

Agents → RAG (Retrieval-Augmented Generation)

RAG is agents applied to knowledge retrieval. The "retrieve" step is just a tool call to a vector database. The model asks: "Search my docs for information about X." The retrieval tool returns relevant chunks. The model synthesizes. What makes it an agent pattern — not just a two-step pipeline — is that a ReAct agent can retrieve multiple times, follow up on what it finds, and know when it has enough context to answer.

Agents → Autonomous Workflows

Autonomous workflows run without a human in the loop — the agent reads a task queue, executes tasks, writes results, and moves to the next. This requires robust error handling, comprehensive logging, and clear stopping criteria. The engineering is identical to what you've learned — the difference is operational: nobody is watching, so every failure mode must be handled in code, not handed to a human.

Agents → Multi-Agent Systems

The most powerful systems have multiple agents collaborating. An orchestrator agent breaks down a complex task and delegates subtasks to worker agents, each with specialized tool sets. The orchestrator collects results and synthesizes. This is architecturally identical to distributed systems: each agent is a service, the orchestrator is an API gateway, and the message history is the shared data store.

Agent Ecosystem Map

Click a concept to see how it connects to the agent fundamentals from this lesson.

ConceptWhat you learned hereWhat's next
Function callingSingle tool round-tripEvaluating agent accuracy
Tool definitionsJSON Schema for toolsMCP server spec, OpenAPI
ReAct loopThought/Action/ObservationTree-of-thought, MCTS planning
State managementTruncation + summarizationLong-term memory, vector stores
GuardrailsLoop detection, budgetsConstitutional AI, policy models
"What I cannot create, I do not understand." — You now understand agents well enough to build one. Start with a single tool (a calculator or a web search stub). Add the loop. Add one more tool. Each layer you add deepens the understanding. The agent you build for yourself teaches more than any article.
MCP (Model Context Protocol) addresses which limitation of current tool use?