Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, Thomas Scialom (Meta AI) — 2023

Toolformer: Self-Taught Tool Use

Language Models Can Teach Themselves to Use Tools — a self-supervised pipeline where the LLM annotates its own training data with API calls, executes them, and keeps only the calls that reduce perplexity.

Prerequisites: Autoregressive language modeling + Perplexity + Basic probability. That's it.
9
Chapters
8+
Simulations
0
Assumed Knowledge

Chapter 0: The Tool Gap

Ask a large language model "What is 3,847 times 291?" and it will confidently produce a number. Often the wrong number. Ask it "Who won the 2023 Super Bowl?" and it will either hallucinate an answer or give you 2021's winner — because its training data has a cutoff date. Ask it to translate a rare language pair and it'll stumble where a dedicated translator would succeed.

These aren't bugs in the model. They're category errors. We're asking a pattern-matching machine to do things that pattern-matching is fundamentally bad at:

TaskWhy LLMs FailWhat Humans Do
ArithmeticTokens are subwords, not numbers. "3847" is tokenized as ["38","47"]. The model has no ALU.Use a calculator
Current factsTraining data has a cutoff. The model literally doesn't know what happened after that date.Search the web
Rare translationsLow-resource languages have too few training examples for reliable generation.Use a translation service
Today's dateThe model was trained months or years ago. It has no clock.Check a calendar
Specific factsThe model may have seen the answer once in training but can't reliably retrieve it.Look it up in an encyclopedia

Humans solved this problem thousands of years ago. We don't do arithmetic in our heads for large numbers — we use tools. We don't memorize every fact — we look things up. We don't translate by guessing — we consult dictionaries. Tool use is arguably what makes human intelligence so powerful: we offload tasks that our brains are bad at to external systems that are purpose-built for them.

The core insight of Toolformer: Instead of trying to make LLMs better at arithmetic, retrieval, and real-time knowledge (which is fighting the architecture), teach them to call external tools for these tasks. And crucially: teach them to do this without human supervision. The model itself decides when and where to insert tool calls, using nothing but perplexity as a signal.

Before Toolformer, the few approaches to LLM tool use all required human annotation. Someone had to manually label when the model should call a calculator, or write hand-crafted prompts, or build tool-use datasets. This doesn't scale. There are too many tools, too many contexts, and too many edge cases for human annotators to cover.

Toolformer's breakthrough is a fully self-supervised pipeline: the model generates its own training data for tool use, executes the tool calls to get results, and filters out unhelpful calls — all without any human labeling. The only human input is a handful of examples showing the API call format (about 5 per tool).

The Tool Gap

Click "Generate" to see an LLM attempt tasks it's bad at. Then click "With Tools" to see how tool calls fix each failure. The gap between the two is the tool gap.

Why can't you fix an LLM's arithmetic failures just by training it on more math data?

Chapter 1: API Call Format

Before the model can learn to use tools, we need a way to represent tool calls inside regular text. Toolformer solves this with a simple, elegant format: special tokens that bracket API calls, embedded inline in the text wherever the model decides a tool would help.

The format uses two special tokens: <API> marks the start of an API call, and </API> marks the end. Between them sits the tool name, its input, an arrow separator, and optionally the result returned by the tool:

text_before [API_name(input) → result] text_after

Let's see this in action with concrete examples for each of the five tools Toolformer supports:

examples
# Calculator — for arithmetic the model can't do reliably
The mass of Earth is roughly [Calculator(5.97e24 * 1000)  5.97e27] grams.

# Question Answering — for factual recall
The Eiffel Tower is in [QA("Where is the Eiffel Tower?")  Paris, France], built in 1889.

# Wikipedia Search — for detailed information
Photosynthesis [WikiSearch("photosynthesis")  ...converts light energy...] is how plants make food.

# Machine Translation — for text in other languages
The French word "chat" means [MT("chat", French, English)  cat].

# Calendar — for time-relative questions
Today is [Calendar()  February 1, 2023], a Wednesday.

Two critical design decisions make this format work:

Decision 1: Inline placement. The API call is inserted exactly at the position in the text where the tool is needed — not at the beginning or end. This means the model must learn where in a sentence a tool would help, not just whether to call one. The model sees context before the call and must generate context after the result.
Decision 2: The result is embedded in the text. After the arrow (→), the tool's actual output is spliced into the token stream. This means the model can condition its next-token predictions on the tool's result. The result becomes part of the "text" the model sees — it's not in some separate channel.

The special tokens [, ] and → are added to the model's vocabulary. During fine-tuning, the model learns to generate these tokens at appropriate positions. During inference, when the model generates an opening bracket and tool name, we pause generation, execute the API call, splice the result in, and then resume generation.

What makes a good call position?

Not every sentence needs a tool call. The model should call a calculator for "3847 times 291" but not for "one plus one." It should search Wikipedia for obscure facts but not for "the sky is blue." The position in the text matters too — calling a calculator before stating a number is useful (it provides the answer), but calling it after is pointless (the number is already in the text).

This is where the self-supervised pipeline comes in. Instead of having humans label good call positions, the model figures this out by itself. The key insight: a good tool call is one that makes the model's predictions better. If inserting a calculator call before a number helps the model predict that number more accurately, it's a good call. If it doesn't help — or makes things worse — discard it.

API Call Anatomy

Click on different parts of the API call to see what each component does. Use the dropdown to switch between tool types.

Why are Toolformer API calls placed inline at a specific position in the text, rather than appended at the end?

Chapter 2: The Annotation Pipeline

Here's the challenge: we want to fine-tune a language model to insert API calls at useful positions in text. But we have no labeled data. No human has gone through millions of sentences marking "calculator call here" or "search here." How do we get training data?

Toolformer's answer: make the model label itself. The pipeline has three stages, and each one is clever in its own right.

Stage 1: Sample Candidates
Prompt the model with a few examples of API call usage. For each position in a text, ask: "Would a tool call help here?" Generate candidate API calls.
Stage 2: Execute Calls
Actually run each candidate API call against the real tool. Get the result. Splice it into the text.
Stage 3: Filter by Perplexity
Compare model perplexity with vs. without the API call result. Keep only calls that reduce perplexity by more than a threshold. Discard the rest.
Fine-tune
Train the original model on the filtered dataset — text with useful API calls embedded. The model learns when and how to invoke tools.

Stage 1: Sampling candidates

For each text x in the dataset (a subset of CCNet), and for each position i in that text, we ask: should the model insert a tool call here? To do this, we prompt the pretrained model with a handful of examples (about 5 per tool) showing how API calls are used in context:

prompt
# Few-shot prompt for Calculator tool call generation
Your task is to add calls to a Calculator API.
Here are examples of how to use it:

Input: The number in the next row is [Calculator(expression)].
Output: The number in the next row is [Calculator(expression) → result].

Input: <text with position i marked>
Output:

The model generates up to k = 5 candidate API calls per position using nucleus sampling (top-p = 0.1 for focused candidates). Not every position gets candidates — the model often declines to insert a call, which is correct behavior for positions where no tool would help.

Why few-shot prompting works here: We're not teaching the model how to use tools from scratch. The pretrained model already "knows" about calculators, search engines, and APIs from its training data. The few-shot examples just teach it the format — the bracketing syntax and the specific tool names. Five examples per tool is enough because the model is already primed to understand tool usage conceptually.

Stage 2: Executing the calls

Each candidate API call is actually executed. If the model generated [Calculator(3847 * 291)], we run that calculation and get 1,119,477. If it generated [WikiSearch("photosynthesis")], we query Wikipedia and get the first paragraph. The result is spliced into the text at position i.

This produces two versions of each text around position i:

VersionWhat It Contains
With result...textbefore [Calculator(3847*291) → 1119477] textafter...
Without result...textbefore [Calculator(3847*291) → ] textafter...
Original...textbefore textafter... (no API call at all)

Now comes the filtering step — the most important part of the entire pipeline.

Self-Supervised Annotation Pipeline

Step through the annotation pipeline. Click "Next Step" to see each stage: candidate generation, execution, filtering, and the final annotated text.

Stage: Ready

The numbers

Schick et al. ran this pipeline on a subset of CCNet (about 10% of C4). For each of the five tools, they sampled candidates across millions of text positions. After filtering (covered in the next chapter), roughly 25,000 API calls survived per tool — about 100,000 total annotated calls across the entire dataset. These were mixed back into the original (unannotated) data for fine-tuning.

The model used for both candidate generation and fine-tuning was GPT-J (6.7B parameters). The entire pipeline — sampling, executing, filtering, fine-tuning — runs without any human labeling. The only human input is those ~25 few-shot examples (5 per tool).

In Toolformer's annotation pipeline, what is the role of the pretrained model during Stage 1 (candidate sampling)?

Chapter 3: The Perplexity Filter

Stage 2 gave us thousands of candidate API calls, each executed with a real result. But many of these calls are useless. The model might insert a calculator call for "1+1" (unnecessary — the model already knows it's 2). It might search Wikipedia for "water is wet" (no new information). It might insert a call in the wrong position (after the answer is already stated).

We need an automatic way to distinguish helpful calls from useless ones. Toolformer's answer is brilliantly simple: a tool call is useful if and only if it reduces the model's perplexity on the tokens that follow it.

The filtering criterion

For each candidate API call at position i with input c and result r, we compute two losses:

L+ = L(e(c, r)) = loss on tokens after [API(c) → r]

This is the model's cross-entropy loss on the tokens following the API call with the result included. If the result is useful, the model should predict the following tokens more accurately.

L = min( L(ε), L(e(c, ε)) )

This is the better of two baselines: (1) the loss with no API call at all, and (2) the loss with the API call but an empty result. We take the minimum because we want to be conservative — the tool call must beat the best alternative.

The filtering rule: keep the call if and only if

L − L+ ≥ τ

where τ is a threshold. The paper uses τ = 1.0 as the default, meaning the tool call must reduce log-loss by at least 1 nat. This is a high bar — it means the tool result must make the model substantially more confident about what comes next.

Why take the minimum of two baselines? Consider a calculator call for "1+1=2." The model already knows "1+1=2" without any tool, so L(ε) (no call) is already low. The calculator returns "2" which doesn't help. L+ ≈ L(ε), so L − L+ ≈ 0 < τ. The call is filtered out. Now consider "3847 × 291." The model is uncertain about the answer, so L(ε) is high. The calculator returns 1,119,477, making the model very confident. L+ << L(ε), so L − L+ ≫ τ. The call is kept.

Why the empty-result baseline matters

The second baseline — L(e(c, ε)), loss with the API call but no result — catches a subtle failure mode. Sometimes the act of inserting a call itself helps, even without a result. For example, inserting [Calculator(3847*291) → ] might prime the model to guess a plausible number. We don't want to keep calls that help only because of this priming effect — the tool must contribute actual information.

By requiring the call-with-result to beat the call-without-result, we ensure that the result specifically is what provides value, not just the format of the call.

The threshold τ

Schick et al. experimented with several values of τ:

τEffectCalls Kept
0.0Keep any call that doesn't hurt. Too permissive — floods text with marginal calls.Many
0.5Moderate filtering. Some borderline calls survive.Medium
1.0Default. Only clearly helpful calls survive. Good balance.~25K/tool
2.0Very strict. Only the most impactful calls survive. Too few training examples.Few

The optimal τ balances precision (only helpful calls) against recall (enough training examples). Too low and the model learns to insert unnecessary calls. Too high and it doesn't learn to call tools at all.

Perplexity Filter Simulator

Drag the τ threshold slider and watch which API calls survive filtering. Each bar shows L − L+ for a candidate call. Calls above the threshold (green) are kept; calls below (red) are discarded.

τ threshold 1.0
A candidate [Calculator(1+1) → 2] is inserted before the text "equals 2." The model already predicts "2" with 95% confidence without any call. What happens during filtering?

Chapter 4: The Five Tools

Toolformer demonstrates its pipeline with five specific tools. Each has a different input format, a different external system behind it, and a different kind of gap it fills. Understanding when the model chooses each one — and when it doesn't — reveals a lot about what LLMs actually struggle with.

1. Calculator

The simplest tool. Takes a mathematical expression as a string, evaluates it, returns the numeric result.

format
[Calculator(expression)  result]

# Example: "The area is [Calculator(3.14 * 12.5 * 12.5) → 490.625] square meters."

The model learns to invoke the calculator for multi-digit multiplication, division, and expressions with multiple operations. It does not learn to call it for trivial arithmetic ("2+2") because those calls fail the perplexity filter — the model already knows the answer.

Emergent behavior: Toolformer learns to call the calculator before the number it will produce, not after. This makes sense — the call is useful only if the result arrives in time for the model to condition on it. Calls inserted after the number are filtered out because they don't reduce perplexity on subsequent tokens.

2. Question Answering (Atlas)

Backed by Atlas, a retrieval-augmented QA system. Takes a natural language question, returns a short answer.

format
[QA("question")  answer]

# Example: "The first president [QA("Who was the first US president?") → George Washington] served two terms."

The QA tool handles factual recall that the model is uncertain about. The model learns to phrase questions in natural language and to position the call right before it needs the factual information.

3. Wikipedia Search

Queries the Wikipedia API, returns the first sentence of the most relevant article. Useful for providing context about entities, events, and concepts.

format
[WikiSearch("query")  first_sentence]

# Example: "[WikiSearch("Solvay Conference") → The Solvay Conferences...] brought together the greatest physicists."

4. Machine Translation

Uses a neural MT system (600M parameter model). Takes source text, source language, and target language.

format
[MT("text", source_lang, target_lang)  translation]

# Example: "The motto [MT("E pluribus unum", Latin, English) → Out of many, one] appears on..."

5. Calendar

The simplest API: takes no arguments, returns today's date. This is the one tool the model literally cannot replicate — it has zero information about the current date.

format
[Calendar()  Today is Wednesday, February 1, 2023.]
Tool usage patterns the model discovers: After fine-tuning, the model uses tools in predictable patterns. Calculator calls cluster around numerical text. QA calls appear before factual claims about specific entities. Wikipedia searches precede passages about obscure topics. Translation calls appear around foreign-language text. Calendar calls appear in date-sensitive contexts ("this week," "currently," "today"). The model never had to be told these patterns — it discovered them through the perplexity filter.

How often does the model call each tool?

After fine-tuning, Schick et al. measured how frequently the model inserts calls during generation on new text:

ToolCalls Kept (training)Usage Pattern
Calculator~25KNumbers, arithmetic, percentages, unit conversions
QA~25KNamed entities, historical facts, specific claims
WikiSearch~25KObscure entities, event details, definitions
MT~25KForeign language text, etymologies, translations
Calendar~25KDate-relative phrases: "today," "this week," "currently"
Tool Selection Patterns

Click on each text snippet to see which tool Toolformer would invoke. The highlight shows the position where the API call is inserted, and the tool badge shows which API is called.

Example 1/6
The model generates: "In the 2022 midterms, approximately [Calculator(435-213)] seats went to Democrats." Why does the perplexity filter keep this Calculator call?

Chapter 5: Results

Does self-supervised tool learning actually work? The results are striking — and reveal both the power and the limits of the approach.

Downstream task performance

Toolformer (based on GPT-J, 6.7B parameters) was evaluated on multiple benchmarks and compared against much larger models:

BenchmarkTaskGPT-J (6.7B)Toolformer (6.7B)OPT (66B)GPT-3 (175B)
SQuADQA55.269.156.752.6
GoogleREKnowledge8.338.67.36.3
T-RExKnowledge37.453.139.338.8
Math (ASDiv)Arithmetic29.544.015.652.9
MLQAMultilingual QA24.733.523.828.4
TempLAMATemporal QA24.530.422.123.6

The pattern is clear. On tasks that benefit from tool use — factual QA, knowledge retrieval, arithmetic, temporal questions — Toolformer (6.7B) outperforms OPT (66B), a model 10x its size. On some benchmarks, it's competitive with GPT-3 (175B), a model 26x its size.

A 6.7B model with tools beats a 66B model without them. This is the headline result. It demonstrates that tool use is not just a nice-to-have — it's a multiplicative capability. Rather than scaling model parameters to memorize more facts (expensive, diminishing returns), you can keep the model small and give it access to external knowledge. The tools effectively act as infinite additional "parameters" that are always up-to-date.

Where tools help most

The gains are not uniform. Tools help most when the task requires information the model doesn't have:

Large gains (tools help a lot):

  • Factual recall of specific entities (GoogleRE: +30.3 points)
  • Arithmetic with large numbers (ASDiv: +14.5 points)
  • Temporal/date-sensitive questions (TempLAMA: +5.9 points)
  • Multilingual QA requiring translation (MLQA: +8.8 points)

Modest or no gains (tools don't help):

  • Commonsense reasoning (no tool for "common sense")
  • Reading comprehension of provided text (answer is already in context)
  • Tasks requiring multi-step reasoning (only single calls supported)
  • Creative generation (no tool for creativity)

Does language modeling quality degrade?

A crucial concern: by fine-tuning the model to insert API calls, does its general language modeling ability suffer? Schick et al. checked this by measuring perplexity on standard language modeling benchmarks. The answer: no degradation. Toolformer's perplexity on standard text is essentially unchanged from GPT-J. The model learns tool use as an additional capability, not a replacement for existing ones.

This happens because the filtered API calls are mixed back into the original (unmodified) training data. The vast majority of training examples have no API calls — the model sees tool-augmented text only where tools are genuinely helpful. This prevents catastrophic forgetting of general language capabilities.

Performance Comparison

Compare Toolformer (6.7B) against larger models across benchmarks. Toggle models on/off to see the comparison. Notice how a small model with tools competes with models 10-26x larger.

Why does Toolformer (6.7B) outperform OPT (66B) on factual QA despite being 10x smaller?

Chapter 6: Limitations

Toolformer is a proof of concept, not the final word. Its limitations reveal the hardest unsolved problems in LLM tool use — problems that subsequent work (like Gorilla, ToolLLM, and modern agent frameworks) is still working to solve.

Limitation 1: Single, non-chained calls

Toolformer can insert one API call at a time. It cannot chain calls — using the output of one tool as the input to another. Real-world tool use often requires composition:

what Toolformer can't do
# Multi-step reasoning requiring chained tools:
"How many days until the next US presidential inauguration?"

# This requires:
# 1. [Calendar() → Today is Feb 1, 2023]
# 2. [QA("When is the next US inauguration?") → January 20, 2025]
# 3. [Calculator(days_between(Feb 1 2023, Jan 20 2025)) → 719]
# Toolformer can only do ONE of these, not chain them.
Why chaining is hard: The self-supervised pipeline generates and filters calls independently. To learn chaining, the model would need to generate sequences of calls, execute them in order, and evaluate the combined perplexity improvement. The search space explodes: with 5 tools and 3 steps, there are 53 = 125 possible chains per position, each requiring execution and evaluation.

Limitation 2: Simple tool interfaces

All five tools have simple input/output formats: a string in, a string out. Real-world APIs are far more complex: they have authentication, multiple endpoints, structured JSON responses, pagination, error handling, rate limits, and stateful sessions. Toolformer's format can't represent:

Real-World ComplexityWhy Toolformer Can't Handle It
Multi-turn API sessionsEach call is independent, no state between calls
Structured JSON outputsResults are flattened to a single text string
Error handling and retriesNo mechanism to detect and recover from API failures
AuthenticationNo concept of API keys, OAuth, or tokens
Conditional logic"If search returns X, then calculate Y" is impossible

Limitation 3: No interactive tool use

Toolformer decides to use tools during text generation. It cannot use tools interactively — pausing generation, examining a result, deciding whether to try a different query, or refining its approach based on what the tool returns. Each tool call is fire-and-forget.

Limitation 4: Fixed tool set

Adding a new tool requires re-running the entire annotation pipeline: writing few-shot examples, sampling candidates, executing calls, filtering, and re-fine-tuning. You can't add tools dynamically at inference time — the model must be trained on examples of each tool's usage.

Limitation 5: Only evaluated on GPT-J

The paper only evaluates with GPT-J (6.7B). It remains unclear how the approach scales — do larger models need fewer tool call examples? Do they learn more complex tool usage patterns? Do they spontaneously learn to chain calls? These questions were unanswered by the paper (though subsequent work has partially addressed them).

The gap between Toolformer and modern agents: Today's tool-using systems (GPT-4 function calling, Claude tool use, LangChain agents) solve several of these limitations. They support dynamic tool sets, chained multi-step calls, structured I/O, and interactive refinement. But they achieve this through prompting and API design, not through self-supervised learning. The dream of Toolformer — a model that teaches itself which tools to use — remains only partially realized.
Single vs. Chained Tool Calls

See how a single-call system fails on multi-step problems. Click through steps to trace a chained query that requires Calculator + QA + Calendar. The red path shows what Toolformer can do; the green path shows what chaining would enable.

Click to start
A user asks: "What's the population of France divided by the area of Germany?" Why can't Toolformer answer this?

Chapter 7: Toolformer Lab

Now it's your turn. This interactive lab lets you walk through Toolformer's complete pipeline: write a sentence, see where the model proposes API calls, watch the calls execute, and see which survive the perplexity filter.

Toolformer Pipeline Simulator

Select a text scenario, then step through the full Toolformer pipeline: candidate generation, execution, filtering, and final annotated text. Adjust the τ threshold to see how it changes which calls survive.

τ threshold 1.0
Scenario 1/4 — Ready
Tool Call Success Rate

This chart shows the fraction of candidate API calls that survive the perplexity filter for each tool, at different τ values. Drag the threshold to see how filtering aggressiveness affects each tool differently.

τ 1.0

Chapter 8: Connections

Toolformer sits at a pivotal moment in the evolution of LLM tool use. It proved that self-supervised tool learning is possible. Everything that came after either builds on or reacts to its ideas.

What Toolformer started

SystemYearRelation to Toolformer
ReAct2023Interleaves reasoning and tool calls. Solves the "no chaining" limitation via prompting, not self-supervised learning.
Gorilla2023Scales tool learning to 1,600+ real APIs using a retriever to find relevant API docs. Addresses the "fixed tool set" limitation.
ToolLLM202316,000+ real-world APIs with multi-step planning. Uses a decision tree (DFSDT) for tool call chains.
GPT-4 Function Calling2023Production-grade tool use. Dynamic tool definitions at inference time, structured JSON I/O, parallel calls. Likely trained with supervised data, not self-supervised.
Claude Tool Use2024Similar to GPT-4 function calling. Tool schemas defined per-request. No pre-training pipeline needed.
LangChain / LlamaIndex Agents2023+Orchestration frameworks that solve chaining, error handling, and multi-tool composition at the application layer — above the model.

The big picture

Tool use in LLMs has evolved along two parallel tracks:

Track 1: Model-level tool learning (Toolformer's approach)

Teach the model itself to generate tool calls. The model decides when, which, and how to call tools. Training-time learning. Requires re-training for new tools. Self-supervised signal (perplexity).

Track 2: System-level tool orchestration (Modern agents)

Keep the model as a general reasoner. Define tools as external schemas. Let the model see tool descriptions at inference time and generate structured calls. No retraining needed. Human-designed interfaces.

Track 2 has won commercially — it's what GPT-4, Claude, and Gemini use. But Track 1's insight remains powerful: a model that can discover when tools are useful, without being told, has a deeper kind of tool competence. The ideal system would combine both: self-supervised discovery of when tools help (Track 1) with flexible, dynamic tool definition at inference time (Track 2).

Toolformer's lasting contribution: It's not the specific tools, the GPT-J base model, or even the pipeline. It's the idea that perplexity improvement is a universal, unsupervised signal for tool usefulness. Any time you can measure whether an external call made the model's predictions better, you can self-supervise tool learning. This principle extends far beyond the five tools in the paper.

Limitations revisited

The problems Toolformer couldn't solve — chaining, dynamic tools, interactive refinement, complex I/O — are precisely the problems that define the current frontier of LLM agents. Understanding Toolformer's limitations is understanding the roadmap for modern AI agent research.

"The real power of a mind is not what it can do, but what it knows to delegate."

What is the key difference between Toolformer's approach (Track 1) and modern function-calling systems like GPT-4 (Track 2)?