Language Models Can Teach Themselves to Use Tools — a self-supervised pipeline where the LLM annotates its own training data with API calls, executes them, and keeps only the calls that reduce perplexity.
Ask a large language model "What is 3,847 times 291?" and it will confidently produce a number. Often the wrong number. Ask it "Who won the 2023 Super Bowl?" and it will either hallucinate an answer or give you 2021's winner — because its training data has a cutoff date. Ask it to translate a rare language pair and it'll stumble where a dedicated translator would succeed.
These aren't bugs in the model. They're category errors. We're asking a pattern-matching machine to do things that pattern-matching is fundamentally bad at:
| Task | Why LLMs Fail | What Humans Do |
|---|---|---|
| Arithmetic | Tokens are subwords, not numbers. "3847" is tokenized as ["38","47"]. The model has no ALU. | Use a calculator |
| Current facts | Training data has a cutoff. The model literally doesn't know what happened after that date. | Search the web |
| Rare translations | Low-resource languages have too few training examples for reliable generation. | Use a translation service |
| Today's date | The model was trained months or years ago. It has no clock. | Check a calendar |
| Specific facts | The model may have seen the answer once in training but can't reliably retrieve it. | Look it up in an encyclopedia |
Humans solved this problem thousands of years ago. We don't do arithmetic in our heads for large numbers — we use tools. We don't memorize every fact — we look things up. We don't translate by guessing — we consult dictionaries. Tool use is arguably what makes human intelligence so powerful: we offload tasks that our brains are bad at to external systems that are purpose-built for them.
Before Toolformer, the few approaches to LLM tool use all required human annotation. Someone had to manually label when the model should call a calculator, or write hand-crafted prompts, or build tool-use datasets. This doesn't scale. There are too many tools, too many contexts, and too many edge cases for human annotators to cover.
Toolformer's breakthrough is a fully self-supervised pipeline: the model generates its own training data for tool use, executes the tool calls to get results, and filters out unhelpful calls — all without any human labeling. The only human input is a handful of examples showing the API call format (about 5 per tool).
Click "Generate" to see an LLM attempt tasks it's bad at. Then click "With Tools" to see how tool calls fix each failure. The gap between the two is the tool gap.
Before the model can learn to use tools, we need a way to represent tool calls inside regular text. Toolformer solves this with a simple, elegant format: special tokens that bracket API calls, embedded inline in the text wherever the model decides a tool would help.
The format uses two special tokens: <API> marks the start of an API call, and </API> marks the end. Between them sits the tool name, its input, an arrow separator, and optionally the result returned by the tool:
Let's see this in action with concrete examples for each of the five tools Toolformer supports:
examples # Calculator — for arithmetic the model can't do reliably The mass of Earth is roughly [Calculator(5.97e24 * 1000) → 5.97e27] grams. # Question Answering — for factual recall The Eiffel Tower is in [QA("Where is the Eiffel Tower?") → Paris, France], built in 1889. # Wikipedia Search — for detailed information Photosynthesis [WikiSearch("photosynthesis") → ...converts light energy...] is how plants make food. # Machine Translation — for text in other languages The French word "chat" means [MT("chat", French, English) → cat]. # Calendar — for time-relative questions Today is [Calendar() → February 1, 2023], a Wednesday.
Two critical design decisions make this format work:
The special tokens [, ] and → are added to the model's vocabulary. During fine-tuning, the model learns to generate these tokens at appropriate positions. During inference, when the model generates an opening bracket and tool name, we pause generation, execute the API call, splice the result in, and then resume generation.
Not every sentence needs a tool call. The model should call a calculator for "3847 times 291" but not for "one plus one." It should search Wikipedia for obscure facts but not for "the sky is blue." The position in the text matters too — calling a calculator before stating a number is useful (it provides the answer), but calling it after is pointless (the number is already in the text).
This is where the self-supervised pipeline comes in. Instead of having humans label good call positions, the model figures this out by itself. The key insight: a good tool call is one that makes the model's predictions better. If inserting a calculator call before a number helps the model predict that number more accurately, it's a good call. If it doesn't help — or makes things worse — discard it.
Click on different parts of the API call to see what each component does. Use the dropdown to switch between tool types.
Here's the challenge: we want to fine-tune a language model to insert API calls at useful positions in text. But we have no labeled data. No human has gone through millions of sentences marking "calculator call here" or "search here." How do we get training data?
Toolformer's answer: make the model label itself. The pipeline has three stages, and each one is clever in its own right.
For each text x in the dataset (a subset of CCNet), and for each position i in that text, we ask: should the model insert a tool call here? To do this, we prompt the pretrained model with a handful of examples (about 5 per tool) showing how API calls are used in context:
prompt # Few-shot prompt for Calculator tool call generation Your task is to add calls to a Calculator API. Here are examples of how to use it: Input: The number in the next row is [Calculator(expression)]. Output: The number in the next row is [Calculator(expression) → result]. Input: <text with position i marked> Output:
The model generates up to k = 5 candidate API calls per position using nucleus sampling (top-p = 0.1 for focused candidates). Not every position gets candidates — the model often declines to insert a call, which is correct behavior for positions where no tool would help.
Each candidate API call is actually executed. If the model generated [Calculator(3847 * 291)], we run that calculation and get 1,119,477. If it generated [WikiSearch("photosynthesis")], we query Wikipedia and get the first paragraph. The result is spliced into the text at position i.
This produces two versions of each text around position i:
| Version | What It Contains |
|---|---|
| With result | ...textbefore [Calculator(3847*291) → 1119477] textafter... |
| Without result | ...textbefore [Calculator(3847*291) → ] textafter... |
| Original | ...textbefore textafter... (no API call at all) |
Now comes the filtering step — the most important part of the entire pipeline.
Step through the annotation pipeline. Click "Next Step" to see each stage: candidate generation, execution, filtering, and the final annotated text.
Schick et al. ran this pipeline on a subset of CCNet (about 10% of C4). For each of the five tools, they sampled candidates across millions of text positions. After filtering (covered in the next chapter), roughly 25,000 API calls survived per tool — about 100,000 total annotated calls across the entire dataset. These were mixed back into the original (unannotated) data for fine-tuning.
The model used for both candidate generation and fine-tuning was GPT-J (6.7B parameters). The entire pipeline — sampling, executing, filtering, fine-tuning — runs without any human labeling. The only human input is those ~25 few-shot examples (5 per tool).
Stage 2 gave us thousands of candidate API calls, each executed with a real result. But many of these calls are useless. The model might insert a calculator call for "1+1" (unnecessary — the model already knows it's 2). It might search Wikipedia for "water is wet" (no new information). It might insert a call in the wrong position (after the answer is already stated).
We need an automatic way to distinguish helpful calls from useless ones. Toolformer's answer is brilliantly simple: a tool call is useful if and only if it reduces the model's perplexity on the tokens that follow it.
For each candidate API call at position i with input c and result r, we compute two losses:
This is the model's cross-entropy loss on the tokens following the API call with the result included. If the result is useful, the model should predict the following tokens more accurately.
This is the better of two baselines: (1) the loss with no API call at all, and (2) the loss with the API call but an empty result. We take the minimum because we want to be conservative — the tool call must beat the best alternative.
The filtering rule: keep the call if and only if
where τ is a threshold. The paper uses τ = 1.0 as the default, meaning the tool call must reduce log-loss by at least 1 nat. This is a high bar — it means the tool result must make the model substantially more confident about what comes next.
The second baseline — L(e(c, ε)), loss with the API call but no result — catches a subtle failure mode. Sometimes the act of inserting a call itself helps, even without a result. For example, inserting [Calculator(3847*291) → ] might prime the model to guess a plausible number. We don't want to keep calls that help only because of this priming effect — the tool must contribute actual information.
By requiring the call-with-result to beat the call-without-result, we ensure that the result specifically is what provides value, not just the format of the call.
Schick et al. experimented with several values of τ:
| τ | Effect | Calls Kept |
|---|---|---|
| 0.0 | Keep any call that doesn't hurt. Too permissive — floods text with marginal calls. | Many |
| 0.5 | Moderate filtering. Some borderline calls survive. | Medium |
| 1.0 | Default. Only clearly helpful calls survive. Good balance. | ~25K/tool |
| 2.0 | Very strict. Only the most impactful calls survive. Too few training examples. | Few |
The optimal τ balances precision (only helpful calls) against recall (enough training examples). Too low and the model learns to insert unnecessary calls. Too high and it doesn't learn to call tools at all.
Drag the τ threshold slider and watch which API calls survive filtering. Each bar shows L− − L+ for a candidate call. Calls above the threshold (green) are kept; calls below (red) are discarded.
Toolformer demonstrates its pipeline with five specific tools. Each has a different input format, a different external system behind it, and a different kind of gap it fills. Understanding when the model chooses each one — and when it doesn't — reveals a lot about what LLMs actually struggle with.
The simplest tool. Takes a mathematical expression as a string, evaluates it, returns the numeric result.
format [Calculator(expression) → result] # Example: "The area is [Calculator(3.14 * 12.5 * 12.5) → 490.625] square meters."
The model learns to invoke the calculator for multi-digit multiplication, division, and expressions with multiple operations. It does not learn to call it for trivial arithmetic ("2+2") because those calls fail the perplexity filter — the model already knows the answer.
Backed by Atlas, a retrieval-augmented QA system. Takes a natural language question, returns a short answer.
format [QA("question") → answer] # Example: "The first president [QA("Who was the first US president?") → George Washington] served two terms."
The QA tool handles factual recall that the model is uncertain about. The model learns to phrase questions in natural language and to position the call right before it needs the factual information.
Queries the Wikipedia API, returns the first sentence of the most relevant article. Useful for providing context about entities, events, and concepts.
format [WikiSearch("query") → first_sentence] # Example: "[WikiSearch("Solvay Conference") → The Solvay Conferences...] brought together the greatest physicists."
Uses a neural MT system (600M parameter model). Takes source text, source language, and target language.
format [MT("text", source_lang, target_lang) → translation] # Example: "The motto [MT("E pluribus unum", Latin, English) → Out of many, one] appears on..."
The simplest API: takes no arguments, returns today's date. This is the one tool the model literally cannot replicate — it has zero information about the current date.
format [Calendar() → Today is Wednesday, February 1, 2023.]
After fine-tuning, Schick et al. measured how frequently the model inserts calls during generation on new text:
| Tool | Calls Kept (training) | Usage Pattern |
|---|---|---|
| Calculator | ~25K | Numbers, arithmetic, percentages, unit conversions |
| QA | ~25K | Named entities, historical facts, specific claims |
| WikiSearch | ~25K | Obscure entities, event details, definitions |
| MT | ~25K | Foreign language text, etymologies, translations |
| Calendar | ~25K | Date-relative phrases: "today," "this week," "currently" |
Click on each text snippet to see which tool Toolformer would invoke. The highlight shows the position where the API call is inserted, and the tool badge shows which API is called.
Does self-supervised tool learning actually work? The results are striking — and reveal both the power and the limits of the approach.
Toolformer (based on GPT-J, 6.7B parameters) was evaluated on multiple benchmarks and compared against much larger models:
| Benchmark | Task | GPT-J (6.7B) | Toolformer (6.7B) | OPT (66B) | GPT-3 (175B) |
|---|---|---|---|---|---|
| SQuAD | QA | 55.2 | 69.1 | 56.7 | 52.6 |
| GoogleRE | Knowledge | 8.3 | 38.6 | 7.3 | 6.3 |
| T-REx | Knowledge | 37.4 | 53.1 | 39.3 | 38.8 |
| Math (ASDiv) | Arithmetic | 29.5 | 44.0 | 15.6 | 52.9 |
| MLQA | Multilingual QA | 24.7 | 33.5 | 23.8 | 28.4 |
| TempLAMA | Temporal QA | 24.5 | 30.4 | 22.1 | 23.6 |
The pattern is clear. On tasks that benefit from tool use — factual QA, knowledge retrieval, arithmetic, temporal questions — Toolformer (6.7B) outperforms OPT (66B), a model 10x its size. On some benchmarks, it's competitive with GPT-3 (175B), a model 26x its size.
The gains are not uniform. Tools help most when the task requires information the model doesn't have:
Large gains (tools help a lot):
Modest or no gains (tools don't help):
A crucial concern: by fine-tuning the model to insert API calls, does its general language modeling ability suffer? Schick et al. checked this by measuring perplexity on standard language modeling benchmarks. The answer: no degradation. Toolformer's perplexity on standard text is essentially unchanged from GPT-J. The model learns tool use as an additional capability, not a replacement for existing ones.
This happens because the filtered API calls are mixed back into the original (unmodified) training data. The vast majority of training examples have no API calls — the model sees tool-augmented text only where tools are genuinely helpful. This prevents catastrophic forgetting of general language capabilities.
Compare Toolformer (6.7B) against larger models across benchmarks. Toggle models on/off to see the comparison. Notice how a small model with tools competes with models 10-26x larger.
Toolformer is a proof of concept, not the final word. Its limitations reveal the hardest unsolved problems in LLM tool use — problems that subsequent work (like Gorilla, ToolLLM, and modern agent frameworks) is still working to solve.
Toolformer can insert one API call at a time. It cannot chain calls — using the output of one tool as the input to another. Real-world tool use often requires composition:
what Toolformer can't do # Multi-step reasoning requiring chained tools: "How many days until the next US presidential inauguration?" # This requires: # 1. [Calendar() → Today is Feb 1, 2023] # 2. [QA("When is the next US inauguration?") → January 20, 2025] # 3. [Calculator(days_between(Feb 1 2023, Jan 20 2025)) → 719] # Toolformer can only do ONE of these, not chain them.
All five tools have simple input/output formats: a string in, a string out. Real-world APIs are far more complex: they have authentication, multiple endpoints, structured JSON responses, pagination, error handling, rate limits, and stateful sessions. Toolformer's format can't represent:
| Real-World Complexity | Why Toolformer Can't Handle It |
|---|---|
| Multi-turn API sessions | Each call is independent, no state between calls |
| Structured JSON outputs | Results are flattened to a single text string |
| Error handling and retries | No mechanism to detect and recover from API failures |
| Authentication | No concept of API keys, OAuth, or tokens |
| Conditional logic | "If search returns X, then calculate Y" is impossible |
Toolformer decides to use tools during text generation. It cannot use tools interactively — pausing generation, examining a result, deciding whether to try a different query, or refining its approach based on what the tool returns. Each tool call is fire-and-forget.
Adding a new tool requires re-running the entire annotation pipeline: writing few-shot examples, sampling candidates, executing calls, filtering, and re-fine-tuning. You can't add tools dynamically at inference time — the model must be trained on examples of each tool's usage.
The paper only evaluates with GPT-J (6.7B). It remains unclear how the approach scales — do larger models need fewer tool call examples? Do they learn more complex tool usage patterns? Do they spontaneously learn to chain calls? These questions were unanswered by the paper (though subsequent work has partially addressed them).
See how a single-call system fails on multi-step problems. Click through steps to trace a chained query that requires Calculator + QA + Calendar. The red path shows what Toolformer can do; the green path shows what chaining would enable.
Now it's your turn. This interactive lab lets you walk through Toolformer's complete pipeline: write a sentence, see where the model proposes API calls, watch the calls execute, and see which survive the perplexity filter.
Select a text scenario, then step through the full Toolformer pipeline: candidate generation, execution, filtering, and final annotated text. Adjust the τ threshold to see how it changes which calls survive.
This chart shows the fraction of candidate API calls that survive the perplexity filter for each tool, at different τ values. Drag the threshold to see how filtering aggressiveness affects each tool differently.
Toolformer sits at a pivotal moment in the evolution of LLM tool use. It proved that self-supervised tool learning is possible. Everything that came after either builds on or reacts to its ideas.
| System | Year | Relation to Toolformer |
|---|---|---|
| ReAct | 2023 | Interleaves reasoning and tool calls. Solves the "no chaining" limitation via prompting, not self-supervised learning. |
| Gorilla | 2023 | Scales tool learning to 1,600+ real APIs using a retriever to find relevant API docs. Addresses the "fixed tool set" limitation. |
| ToolLLM | 2023 | 16,000+ real-world APIs with multi-step planning. Uses a decision tree (DFSDT) for tool call chains. |
| GPT-4 Function Calling | 2023 | Production-grade tool use. Dynamic tool definitions at inference time, structured JSON I/O, parallel calls. Likely trained with supervised data, not self-supervised. |
| Claude Tool Use | 2024 | Similar to GPT-4 function calling. Tool schemas defined per-request. No pre-training pipeline needed. |
| LangChain / LlamaIndex Agents | 2023+ | Orchestration frameworks that solve chaining, error handling, and multi-tool composition at the application layer — above the model. |
Tool use in LLMs has evolved along two parallel tracks:
Track 1: Model-level tool learning (Toolformer's approach)
Teach the model itself to generate tool calls. The model decides when, which, and how to call tools. Training-time learning. Requires re-training for new tools. Self-supervised signal (perplexity).
Track 2: System-level tool orchestration (Modern agents)
Keep the model as a general reasoner. Define tools as external schemas. Let the model see tool descriptions at inference time and generate structured calls. No retraining needed. Human-designed interfaces.
Track 2 has won commercially — it's what GPT-4, Claude, and Gemini use. But Track 1's insight remains powerful: a model that can discover when tools are useful, without being told, has a deeper kind of tool competence. The ideal system would combine both: self-supervised discovery of when tools help (Track 1) with flexible, dynamic tool definition at inference time (Track 2).
The problems Toolformer couldn't solve — chaining, dynamic tools, interactive refinement, complex I/O — are precisely the problems that define the current frontier of LLM agents. Understanding Toolformer's limitations is understanding the roadmap for modern AI agent research.
"The real power of a mind is not what it can do, but what it knows to delegate."