Siru Ouyang, Jun Yan, Yanfei Chen, Rujun Han, Zifeng Wang et al. (UIUC + Google Cloud AI Research + MIT) — arXiv 2605.06614, 2026

SkillOS: Learning Skill Curation for Self-Evolving Agents

LLM agents solve tasks one at a time, throwing away everything they learned. SkillOS trains a dedicated curator to build, refine, and prune a reusable skill library via RL — turning one-off problem solvers into agents that get better with every interaction.

Prerequisites: LLM agent basics (ReAct, tool calls) + RL intuition (rewards, policy optimization) + Basic probability. That's it.
10
Chapters
10+
Simulations
0
Assumed Knowledge

Chapter 0: The One-Off Problem

Imagine you're a new hire at a help desk. Your first week, a customer asks how to reset their password. You figure it out from scratch — reading docs, trying things, eventually solving it. Twenty minutes later, another customer asks the same question. You start from scratch again. No notes, no checklist, no memory of what just worked.

That is how virtually every LLM agent operates today. Each task arrives. The agent reasons, takes actions, maybe succeeds. Then the interaction ends, the context window is discarded, and the next task begins with a completely blank slate.

This is not a resource limitation — it is an architectural choice. Current agents are stateless by design. They have no mechanism to extract lessons from Task 1 and apply them to Task 47. Every problem is solved as if it were the very first problem the agent has ever encountered.

The cost of starting from scratch

This one-off pattern wastes effort in three concrete ways:

Redundant exploration. If an agent discovers that "navigate to desk, then examine mug under desklamp" is the correct strategy for inspection tasks in a household environment, that insight evaporates. The next inspection task triggers the same trial-and-error search. In ALFWorld (a text-based household benchmark), agents without memory take 21.1 interaction steps on average per task. With SkillOS's learned skills, that drops to 18.9 — a 10.4% reduction.

Repeated failures. Worse than re-exploring, stateless agents repeat the same mistakes. If a particular approach fails on Task 5, nothing prevents the agent from trying the identical approach on Task 32. There is no procedural memory to encode "when NOT to do X."

No capability growth. A human help desk agent after 1,000 tickets is dramatically better than after their first. A stateless LLM agent after 1,000 tasks is identical to the one that started. The curve is flat.

The One-Off Agent: No Memory Persists

Watch an agent solve sequential tasks. Each time, it starts from scratch. The blue bar shows current knowledge — it resets to zero between tasks. Click "Run Stream" to see the waste.

Ready

The canvas shows the tragedy clearly. Every colored block is exploration effort. Every time the bar resets to zero, that effort is lost. A self-evolving agent would carry forward the insight from each task, building a rising curve of capability instead of a flat line.

What would a self-evolving agent look like?

The ideal is an agent that maintains a growing library of reusable skills — Markdown files containing workflows, constraints, and heuristics extracted from past experience. When a new task arrives, the agent retrieves relevant skills from this library and executes more efficiently. After each task, it updates the library with new lessons.

The paper by Ouyang et al. (2026) proposes SkillOS, an RL training recipe that teaches an 8B-parameter model to perform this skill curation — deciding what to insert, what to update, and what to delete from the skill library. The trained curator outperforms even Gemini-2.5-Pro used directly as a curator, demonstrating that targeted training of a small model can beat raw scale.

The core insight: The bottleneck for self-evolving agents is not skill execution (applying skills to tasks) — it is skill curation (deciding what to extract, when to update, and when to delete). SkillOS isolates this curation responsibility into a trainable module and uses RL to optimize it.
An LLM agent solves 100 sequential tasks without any memory mechanism. On task #100, how does its performance compare to task #1?

Chapter 1: Skills as Memory

If we want agents to accumulate experience, we need a format for storing it. Not raw trajectories — those are too long and too specific. Not abstract summaries — those lose the actionable detail. We need something in between: structured, reusable, retrievable.

SkillOS follows a design inspired by Anthropic's SKILL.md format: each skill is a single Markdown file stored in an external repository. The file has two parts:

1. YAML frontmatter — specifies the skill name and a natural-language description of when to use it. This is what the retrieval system matches against.

2. Markdown body — contains the executable knowledge: workflows, constraints, prerequisites, and heuristics. The paper suggests three sections as a starting point, but allows the curator to create additional sections as it learns.

Anatomy of a skill file

Here is a real skill that SkillOS's trained curator produced for ALFWorld inspection tasks:

markdown
---
name: Use light source to examine
description: Ensure object is examined
  under proper light source by navigating
  to the correct lamp location first
---

# Workflow
1. Navigate to the light source (desklamp,
   floorlamp) location first
2. Pick up the target object
3. Use the "examine" action with the light
   source, not the object

# When NOT to Use
- If the light source is not in the
  current room
- If the object is already being examined

# Prerequisite Constraints
- Agent must have free hands
- Light source must be turned on

Notice three critical properties of this format:

Retrievable. The YAML frontmatter contains a description that can be matched against incoming tasks using BM25 (a standard text retrieval algorithm). When a new task says "examine the mug under desklamp," BM25 matches it against "Use light source to examine" and retrieves this skill.

Actionable. The workflow section gives step-by-step instructions the executor can follow directly. This is not abstract wisdom ("inspection is important") — it is a concrete recipe ("navigate to lamp first, then pick up, then examine").

Guarded. The "When NOT to Use" section prevents misapplication. This is crucial: a skill that fires on the wrong task actively hurts performance by misleading the executor.

Why Markdown? LLMs already understand Markdown natively. No special parsing, no embedding conversion, no schema definition. The skill files are injected directly into the executor's prompt context. The model reads them like instructions from a colleague.
Skill File Anatomy

A structured skill file with its two components. Hover or tap sections to highlight their purpose.

The SkillRepo as a whole

The complete skill collection is called the SkillRepo, denoted St at time step t. It is simply a set of Nt Markdown files:

St = { s1, s2, ..., sNt }

The SkillRepo starts empty (S0 = {}) and grows as the curator processes task trajectories. Three operations modify it:

OperationFunction CallEffect
Insertinsert_skill(name, content)Creates a new .md file in the repo
Updateupdate_skill(name, content)Replaces the content of an existing file
Deletedelete_skill(name)Removes a file from the repo

These are implemented as function calls — the curator generates structured JSON that specifies the operation, the target file, and (for insert/update) the new content. The system executes them against the SkillRepo, exactly like file I/O operations in an operating system. This is where the name "SkillOS" comes from.

The OS analogy: Just as an operating system manages files on disk — creating, modifying, deleting — SkillOS manages skill files in a repository. The curator is the process making system calls. The SkillRepo is the filesystem. The executor is the application that reads from it.
A skill file in SkillOS has two main components. What are they?

Chapter 2: The Curation Bottleneck

You might think that once we have the skill format and the three operations (insert, update, delete), the problem is solved. Just prompt the LLM: "Given this task trajectory, produce skill operations." Systems like ReasoningBank and MemP do exactly this — they use heuristic rules or prompted LLMs to manage memory.

It does not work well. The fundamental problem is that curation quality has delayed, indirect feedback.

The delayed feedback problem

Consider a concrete scenario. The agent just completed Task 12 in ALFWorld — "Put a heated egg on the counter." The curator observes the trajectory and decides to insert a skill about heating objects in the microwave. Was this a good decision?

We cannot know until Task 37, when another heating task arrives and the executor retrieves and applies this skill. If Task 37 succeeds faster because of the skill, the insert was good. If the skill's instructions were slightly wrong and caused the executor to fail, the insert was bad. Either way, the feedback arrives 25 tasks later and is mixed with dozens of other confounding factors.

This is fundamentally different from the executor's learning signal. The executor gets reward immediately: "Did you complete the current task? Yes/No." The curator's reward is: "Did the skill you wrote 25 tasks ago help a different task that happened to need it?" That signal is delayed, sparse, and noisy.

The credit assignment problem: When the executor succeeds on Task 37, who gets credit? The executor that took the right actions? The retriever that found the right skill? The curator that wrote the skill 25 tasks ago? All three contributed, but the curator's contribution is the hardest to isolate.

Why heuristics fail

Existing approaches use fixed rules for skill curation. Some examples from the literature:

"Always insert after a successful task." This floods the repo with redundant, overlapping skills. If ten similar tasks all succeed, you get ten nearly-identical skill files that confuse retrieval.

"Delete skills that haven't been used in K tasks." This kills skills that are rare but crucial. A skill for "heating objects in microwave" might only trigger once every 20 tasks, but when it fires, it is essential.

"Update a skill if the executor failed while using it." This conflates skill quality with task difficulty. The executor might have failed because the task was genuinely hard, not because the skill was wrong.

The common thread: heuristics make local decisions without downstream performance feedback. They cannot learn that "inserting a concise skill with a 'When NOT to Use' section leads to 12% higher success on future related tasks" because they never see the downstream outcome.

Insert is easy. Update and delete are hard.

The paper makes an important empirical observation: untrained curators (SkillOS-base) overwhelmingly choose insert. Figure 4 in the paper shows that at the start of training, insert accounts for nearly 100% of all operations. The curator just blindly adds new skills after every task.

This makes intuitive sense. Insert is the "safe" operation — you are adding information, not destroying it. Update requires judging which part of an existing skill is wrong and how to fix it. Delete requires judging that a skill is actively harmful or redundant — a harder call than "this seems useful."

RL training shifts this distribution dramatically. By the end of training, update operations account for a growing fraction (from ~0% to ~25%), and delete begins appearing as well. The curator learns that skill quality matters more than skill quantity.

Curation Decision Space

Three curation operations and what can go wrong with each. Click each operation to see the failure modes that make curation hard.

The training signal gap: Prior RL work on skills (SkillRL, D2Skill) focuses on teaching agents to use skills, not to curate them. ARISE trains retrieval + execution jointly but uses heuristics for management. SkillOS is the first to train the curation policy end-to-end via RL with long-horizon feedback.
Why is skill curation harder to learn than skill execution?

Chapter 3: SkillOS Architecture

SkillOS's key architectural insight is separation of concerns. Instead of training one monolithic model to do everything — solve tasks, extract lessons, manage the skill library — SkillOS splits the system into two independent modules with a shared data structure between them.

The three components

1. Agent Executor (πL) — a frozen LLM that solves tasks. Given a task description xt, the current environment observation ot, and a set of retrieved skills S̃t, the executor produces actions:

a ~ πL( · | xt, ot, S̃t )

The executor is frozen throughout training — its weights never change. This is deliberate: we want to test whether better skills improve performance, not whether a better executor does. The executor can be any model: Qwen3-8B, Qwen3-32B, or Gemini-2.5-Pro. SkillOS's trained curator generalizes across all of them.

2. Skill Curator (πS) — a trainable LLM (Qwen3-8B base) that manages the SkillRepo. After the executor completes a task, the curator observes:

And generates a sequence of curation operations:

ct = (u1, ..., uM) ~ πS( · | ξt, 1ξt, S̃t )

Each um is a function call: insert_skill, update_skill, or delete_skill. These are structured JSON outputs that the system executes against the SkillRepo.

3. SkillRepo (St) — the external skill repository. A collection of Markdown files that grows, changes, and (sometimes) shrinks as the curator operates. Skills are retrieved via BM25 matching against the task description.

Why freeze the executor? If we trained both the executor and curator jointly, we could never tell whether improvements came from better skills or a better executor. By freezing the executor, any performance gain is directly attributable to skill quality. This is the key insight for learning curation: ground curation in downstream executor performance.

The streaming loop

At deployment, SkillOS processes a stream of tasks D = {x1, x2, ..., xT} sequentially. For each task:

1. Retrieve
BM25 matches task description xt against SkillRepo St to get relevant skills S̃t
2. Execute
Frozen executor πL solves the task using retrieved skills, producing trajectory ξt
3. Curate
Trained curator πS observes trajectory + result, generates insert/update/delete operations
4. Update
Operations applied to SkillRepo: St+1 = ApplyOps(St, ct)
↻ next task

This forms a closed loop: the executor's performance depends on the skills the curator produced, and the curator learns from the executor's subsequent performance. The SkillRepo is the shared memory that mediates between them.

SkillOS Streaming Loop

The full SkillOS pipeline processing a stream of tasks. Click "Step" to advance through the loop, or "Auto" to watch it cycle. Toggle "Without Curation" to see a no-memory baseline.

Ready — click Step to begin

Data flow: what goes where

Let's trace the exact data flowing through the system for one task:

ComponentInputOutputTrainable?
BM25 RetrieverTask desc xt + SkillRepo StTop-k skills S̃tNo (fixed algorithm)
Executor πLxt + observations + S̃tTrajectory ξtNo (frozen)
Self-judgeTrajectory ξtBinary correctness 1ξNo (LLM-as-judge)
Curator πSξt + 1ξ + S̃tOperations ctYes (RL-trained)
ApplyOpsSt + ctSt+1No (deterministic)

Only one component is trainable: the curator. Everything else is fixed. This tight bottleneck means all learning signal flows through a single policy, making optimization tractable.

In SkillOS, which component is trained via RL?

Chapter 4: Grouped Task Streams

We established that curation feedback is delayed and indirect. The curator inserts a skill now, and its value is revealed only when a related future task benefits from it. So how do we construct training data that provides this feedback?

SkillOS's first key design: grouped task streams. Instead of training on random sequences of tasks, SkillOS groups related tasks together and trains on entire groups as single instances.

Step 1: Annotate tasks with skill-relevant tags

For each task xi in the training set, SkillOS uses Gemini-2.5-Pro to produce a set of tags:

Zi = { z1, z2, ..., z|Zi| }

Each tag z captures a salient aspect of the task — topic, strategy, common pitfall. For ALFWorld, these are the built-in task type annotations (Pick, Clean, Heat, Cool, etc.). For reasoning tasks like MATH, tags might be "algebra," "Fourier transformation," or "inequality manipulation."

Step 2: Partition into groups

Based on tag similarity, SkillOS partitions the full training set D into M groups:

D = {G1, G2, ..., GM},   Gm = {xm,1, xm,2, ..., xm,|Gm|}

All tasks within a group share non-trivial skill dependencies — they are the kind of tasks where solving one should help solve the others.

Step 3: Train on groups

Each training step samples one group Gm and starts with an empty SkillRepo. The system then iterates through the group's tasks sequentially:

The first task in each group always uses an empty SkillRepo, so its outcome is independent of curation. The task outcome reward is therefore computed only over tasks 2 through |G|:

rtask = (1 / (|G| - 1)) ∑i=2|G| 1(ξi)

This is the core trick: by grouping related tasks, the paper creates a within-group feedback loop where earlier curation decisions are evaluated by later task outcomes. The curator learns to write skills that help on future related tasks, not just skills that describe the current task.

Why grouping matters: Without grouping (random task order), there is no guarantee that a skill extracted from Task A will ever be relevant to any future task in the training sequence. Grouping ensures that each training instance contains multiple tasks that share skill dependencies, providing dense, relevant feedback for curation decisions. Ablations show that removing grouping drops ALFWorld success from 61.2% to 57.3%.
Grouped Task Streams

Related tasks are clustered into groups. Early tasks (darker) generate skills; later tasks (lighter) evaluate them. Click a group to see how skills transfer within it.

Contrast with prior work

Prior RL-based skill methods like ARISE and UMEM train on short task streams — often just 2 adjacent tasks. This limits the density of feedback: the curator only sees whether a skill helped the immediately next task. SkillOS's longer grouped streams (|G| = 4-8 tasks) expose the curator to multi-hop skill evolution, where:

This three-step feedback arc — insert, fail, update, succeed — cannot be learned from 2-task windows. Grouped streams provide the trajectory length needed to learn update and delete behaviors.

Why does SkillOS exclude Task 1 from the task outcome reward?

Chapter 5: Composite Rewards

Grouped task streams provide the structure for learning curation. But we also need the right reward signal. A single "did the downstream task succeed?" reward is too sparse — the curator makes dozens of micro-decisions (which section to write, how verbose to be, whether to include "When NOT to Use") and needs finer-grained feedback.

SkillOS addresses this with a composite reward that combines four signals, each targeting a different failure mode:

r = rtask + λf · rfc + λu · rcnt + λc · rcomp

With weights λf = 1.0, λu = 0.1, λc = 0.05. Let's examine each component.

1. Task outcome reward (rtask)

The primary signal. Average success rate over evaluation tasks (tasks 2 through |G|):

rtask = (1 / (|G| - 1)) ∑i=2|G| 1(ξi)

What it catches: The overall quality of the curated SkillRepo. If the skills are good, downstream tasks succeed more often.

What it misses: Everything about HOW the curator produced those skills. A curator that writes valid, well-structured skills that happen not to be relevant to the evaluation tasks gets rtask = 0. We need additional signals to guide learning when task outcomes are uninformative.

2. Function call validity reward (rfc)

Measures whether the curator produces valid, executable function calls:

rfc = (1 / |G|) ∑i=1|G| Valid(ci)

Where Valid(ci) is the fraction of function calls in curation decision ci that parse correctly and execute successfully. An insert_skill call that references a malformed filename, or an update_skill call targeting a non-existent file, gets a score of 0.

What it catches: Formatting errors, hallucinated filenames, invalid JSON. Without this signal, the curator might spend many early training steps producing outputs that fail to execute at all.

3. Content quality reward (rcnt)

Uses an external judge (Qwen3-32B) to evaluate whether curated skills are semantically meaningful and likely useful:

rcnt = (1 / |G|) ∑i=1|G| Judge(ci)

What it catches: Low-quality content. A skill that just copies the raw trajectory verbatim gets a low judge score. A skill that extracts a clean, generalizable workflow gets a high one. This intermediate supervision is critical in a pipelined system where the curator never directly sees downstream task outcomes.

Ablating rcnt drops ALFWorld success from 61.2% to 58.6% — the largest drop among the auxiliary rewards.

4. Compression reward (rcomp)

Discourages verbatim trajectory copying by rewarding concise repository updates:

rcomp = (1 / |G|) ∑i=1|G| (1 - |Si| / |χi|)

Where |Si| is the token length of the SkillRepo after applying operations at step i, and |χi| is the token length of the curator's input context. If the skills are shorter than the input (good — we compressed), the reward is positive. If the skills are longer than the input (bad — we're storing raw trajectories), the reward is negative.

What it catches: Bloated repositories. An important failure mode is the curator copying entire trajectories into skill files instead of distilling them into concise instructions. The compression reward explicitly penalizes this.

Each reward handles a different failure mode: rtask catches bad content, rfc catches formatting errors, rcnt catches low-quality or trivial skills, and rcomp catches bloated repositories. Together, they turn a sparse, delayed signal into dense, multi-faceted supervision.
Composite Reward Breakdown

Adjust the sliders to see how each reward component contributes to the total. The paper uses λf=1.0, λu=0.1, λc=0.05.

rtask 0.70
rfc 0.90
rcnt 0.60
rcomp 0.50

Weight tuning

The paper sets λf = 1.0 (function call validity weighted equally with task outcome), λu = 0.1 (content quality is a soft guide, not a hard constraint), and λc = 0.05 (compression is a gentle nudge). This weighting makes sense: task outcome is the ground truth, function calls must be valid for anything to work, content quality is informative but subjective, and compression is a nice-to-have.

The ablation study shows that removing which auxiliary reward causes the largest performance drop on ALFWorld?

Chapter 6: GRPO Training

Now we have the training structure (grouped task streams) and the reward signal (composite reward). How do we actually optimize the curator policy? SkillOS uses Group Relative Policy Optimization (GRPO), an RL algorithm originally developed for DeepSeek-Math.

Why GRPO?

Standard policy gradient methods like PPO require a separate critic network — a value function that estimates expected future reward from each state. Training a critic for skill curation is problematic because:

GRPO eliminates the critic entirely. Instead, it estimates advantages by comparing multiple rollouts of the same task group against each other.

How GRPO works

For each task group G, SkillOS samples N independent rollouts from the curator policy. Each rollout produces a different sequence of curation decisions, which leads to a different SkillRepo evolution, which leads to different executor outcomes. This gives N composite reward values {r1, r2, ..., rN}.

The advantage for rollout n is simply:

An = rn - (1/N) ∑n'=1N rn'

That is: "How much better (or worse) was this rollout compared to the average?" No critic, no value function, just relative comparison within the group.

The policy is then updated with a clipped surrogate objective (same as PPO's clipping):

L = En[ min(ρn An, clip(ρn, 1-ε, 1+ε) An) ]

Where ρn = πS(cn | χ) / πθold(cn | χ) is the importance ratio between the current and old policy. The clipping prevents the policy from changing too drastically in one step.

No KL penalty: Unlike standard GRPO implementations, SkillOS discards the KL divergence term that penalizes deviation from the reference policy. This is deliberate: the authors want to encourage exploration. Early in training, the curator needs to experiment with update and delete operations it has never tried before. A KL penalty would anchor it to the insertion-heavy initial behavior.

Training details

HyperparameterValue
Base model for πSQwen3-8B
Executor during trainingQwen3-8B (frozen)
Learning rate1 × 10-6
Batch size32 (task groups per batch)
Group size N (rollouts per group)8
Hardware16 × H100 GPUs
Training time (ALFWorld)~3 days
Training time (WebShop)~5 days
Training time (Reasoning)~2.5 days
Frameworkverl (HybridFlow)

The training trajectory

The paper provides a fascinating view of how the curator evolves during training (Figure 4). The operation distribution tells the story:

Early training (steps 1-10): Insert dominates at ~95%. The curator knows only one move: "See trajectory, write new skill." This is the naive behavior — pure expansion.

Mid training (steps 10-30): Update grows to ~25%. The curator learns that revising existing skills is more valuable than creating new ones. It starts recognizing when an existing skill almost matches but needs refinement.

Late training (steps 30+): Delete appears at ~5-8%. The curator learns to prune redundant or harmful skills. The SkillRepo becomes more curated, not just larger.

GRPO Training: Advantage from Group Comparisons

N rollouts of the same task group produce different rewards. Advantages are computed relative to the group mean. Drag the slider to change the number of rollouts.

Rollouts (N) 8

Worked example: one training step

Let's trace a single training step. The batch samples group G = {Heat Egg, Heat Mug, Heat Apple, Heat Potato}. SkillRepo starts empty.

Rollout 1: After Task 1, curator inserts "Heating objects workflow" skill. Tasks 2-4 all succeed using this skill. rtask = 1.0. Total r = 1.0 + 0.92 + 0.07 + 0.04 = 2.03.

Rollout 2: After Task 1, curator inserts a very verbose skill (copies entire trajectory). Task 2 succeeds but slowly (executor confused by long skill). Task 3 fails. Task 4 succeeds. rtask = 0.67. Compression reward low (0.2). Total r = 0.67 + 0.85 + 0.05 + 0.01 = 1.58.

Rollout 3: Curator produces invalid JSON for the insert call. No skills are added. Tasks 2-4 run without skills. rtask = 0.33. rfc = 0. Total r = 0.33 + 0 + 0 + 0.05 = 0.38.

Mean reward: (2.03 + 1.58 + 0.38) / 3 = 1.33. Advantages: A1 = +0.70, A2 = +0.25, A3 = -0.95. GRPO reinforces Rollout 1's behavior and suppresses Rollout 3's.

What is the key advantage of GRPO over PPO for training the skill curator?

Chapter 7: Results

SkillOS is evaluated across three benchmark categories with multiple executor backbones. The results tell a consistent story: trained curation beats both no-memory baselines and heuristic-based memory systems.

ALFWorld: Multi-turn household tasks

ALFWorld is a text-based environment where agents navigate rooms, manipulate objects, and complete household tasks ("Put a heated egg on the counter," "Examine the mug under desklamp"). There are 6 task subtypes: Pick, Look, Clean, Heat, Cool, and Pick2. Results are reported as success rate (SR) and average interaction steps.

With Qwen3-8B as executor:

MethodAvg SR (%)Steps
No Memory47.921.1
ReasoningBank55.720.1
MemP49.721.0
SkillOS-base (no RL)53.120.4
SkillOS-gemini (Gemini curator)50.720.8
SkillOS61.218.9

Three things stand out. First, SkillOS beats the strongest baseline (ReasoningBank) by +5.5 absolute points. Second, SkillOS reduces interaction steps from 21.1 to 18.9 — the agent is not just more successful, it is faster. Third, the RL-trained 8B curator outperforms Gemini-2.5-Pro used directly as curator (SkillOS-gemini: 50.7%). A small, targeted model beats a frontier model at this specific skill.

8B beats Gemini-2.5-Pro: SkillOS-gemini uses Gemini-2.5-Pro as the curator — a frontier model with far more parameters and reasoning capability. Yet SkillOS's trained 8B curator scores 61.2% vs. 50.7%. This demonstrates that targeted RL training on the specific curation task outweighs raw model scale. The frontier model writes plausible-looking skills, but they may not match what the executor actually needs.

WebShop: Online shopping tasks

WebShop simulates an online shopping environment. The agent navigates a web interface to find and purchase products matching user specifications. Metrics: score, success rate (SR), and interaction steps.

MethodScoreSR (%)Steps
No Memory33.39.820.3
ReasoningBank35.411.420.5
SkillOS-base38.613.620.1
SkillOS40.616.519.4

SkillOS improves SR from 9.8% (no memory) to 16.5% — a 68% relative improvement. The gains are even more dramatic with stronger executors: with Gemini-2.5-Pro as executor, SkillOS reaches 41.3% SR vs. 38.4% for no memory.

Reasoning tasks: AIME24, AIME25, GPQA-Diamond

Single-turn reasoning tasks show more modest gains, but SkillOS still improves consistently:

MethodAIME24AIME25GPQAAvg
No Memory76.071.161.869.6
ReasoningBank75.473.260.369.6
SkillOS80.076.764.673.8

The gains are smaller (+4.2 average accuracy) because reasoning tasks benefit from more abstract skill types (decomposition heuristics, verification patterns) that are harder to capture in procedural skills. Still, SkillOS is the only method that consistently improves over no-memory.

Cross-executor and cross-task transfer

A crucial test: does a curator trained with Qwen3-8B executor transfer to different executors? Yes. SkillOS lifts Gemini-2.5-Pro's ALFWorld SR from 66.4% to 80.2% — a +13.8 improvement, even though the curator never saw this executor during training.

Cross-task transfer (Figure 3 in the paper) also works: a curator trained on reasoning tasks improves ALFWorld performance by +13.3 with Qwen3-8B executor. The reasoning-trained curator learns abstract strategies (decomposition, verification, adaptive planning) that transfer to agentic tasks.

Results Dashboard

Performance comparison across methods and benchmarks. Click a benchmark to see detailed results.

When SkillOS's 8B curator (trained with Qwen3-8B executor) is paired with Gemini-2.5-Pro executor, what happens?

Chapter 8: Emergent Skills

The most fascinating finding in the paper is not the performance numbers — it is what happens inside the SkillRepo as training progresses. The curator does not just get better at inserting skills. It develops an entirely new organizational structure that was never explicitly programmed.

New Markdown sections emerge

The skill format suggests three sections: Workflow, When NOT to Use, and Prerequisites. But SkillOS's trained curator creates additional sections that were never specified. Figure 5(a) in the paper tracks these emergent sections across training:

Early training: The curator adds generic sections — "Additional Guidance," "Tips and Recommendations," "Enhancement." These are verbose and add little operational value. They are the model's default verbosity patterns.

Late training: The sections become execution-oriented:

RL gradually steered the curator from superficial enrichment toward execution-oriented skill refinement. The curator learned — through trial and error — that a "Retry Logic" section makes the executor more robust, while a "Tips" section just adds noise.

Meta-skills emerge

Even more remarkable: the SkillRepo develops skills about skills. Figure 5(b) tracks the evolution of skill categories:

Early SkillRepo: Dominated by narrow, task-specific skills. "How to heat an egg." "How to clean a mug." Each skill covers exactly one task variant.

Late SkillRepo: A diverse mix including meta-strategy skills:

The task-specific skills (e.g., "task-object specific," "task-location specific") shrink from dominating the repo to occupying less than 30%. The curator discovers that abstract, compositional skills are more valuable than narrow, task-specific ones.

Organizational discovery: Nobody told the curator to create meta-skills. Nobody specified categories like "state verification" or "failure recovery." The curator discovered through RL that a repo organized around reusable strategies outperforms one organized around specific tasks. This is emergent structure from reward optimization.

Skill utilization becomes more targeted

Figure 6 in the paper compares skill usage statistics between SkillOS-base and SkillOS:

MetricSkillOS-baseSkillOS
Skill usage rate87.9%100%
Successful skill usage rate53.6%61.2%
Skill coverage72.9%88.6%
Avg skills per example2.241.95

SkillOS invokes skills on 100% of evaluation examples (vs. 87.9% for the base) and achieves higher success when doing so. Crucially, it uses fewer skills per example (1.95 vs. 2.24) while achieving better coverage of the repo (88.6% vs. 72.9%). The trained curator produces skills that are more precisely targeted — less noise, more signal.

SkillRepo Evolution

Watch how the SkillRepo changes over training steps. Skills appear, merge, specialize, and develop meta-structure. Click "Evolve" to advance training.

Training step 0 — empty SkillRepo

A concrete example: skill merging

Early in training, the curator might create separate skills for "Heat egg in microwave" and "Heat mug in microwave." These have 80% identical content — both involve finding the microwave, putting the object in, and turning it on. Only the object name differs.

After RL training, the curator learns to create a single "Heat objects using microwave" skill with a conditional: "Works for any heatable object (egg, mug, apple, potato). Verify object is picked up before approaching microwave." This merged skill is more compact, more retrievable (matches more queries), and easier for the executor to follow.

The compression reward (rcomp) nudges this behavior, but it is the task outcome reward (rtask) that truly drives it: a merged skill that covers four task variants produces better outcomes than four fragmented skills that might or might not be retrieved.

What is the most common type of skill in a mature SkillOS SkillRepo?

Chapter 9: Connections

SkillOS sits at the intersection of several active research areas. Let's map where it fits and what comes next.

The self-evolving agent landscape

SystemMemory TypeCuration MethodKey Difference from SkillOS
ReActNone (stateless)N/ANo memory at all — pure reasoning + acting
ReflexionText reflectionsPrompted LLMStores verbal self-critiques, not reusable skills
ReasoningBankDistilled insightsPrompted LLMNo RL training — heuristic curation only
MemPProcedural memoryHeuristic operationsFixed rules for memory management
Anthropic SkillsSKILL.md foldersManual curationHuman-written skills — no automation
ARISESkill libraryRL (retrieval + use)Trains retrieval + execution, heuristic management
SkillRL / D2SkillPre-curated skillsRL (use only)Trains agents to use skills, not to curate them
SkillOSMarkdown skillsRL (curation)First to train curation end-to-end via long-horizon RL
SkillOS's unique position: All prior systems either (a) use heuristic curation, (b) train skill usage but not curation, or (c) train on short-horizon feedback. SkillOS is the first to train the curation policy itself with long-horizon, executor-grounded RL feedback.

Limitations and open questions

Retrieval bottleneck. SkillOS uses BM25 for skill retrieval — a lexical matching algorithm that cannot capture semantic similarity. A skill titled "Systematic Container Search" would not match a query about "finding hidden objects." Dense retrieval (embedding-based) could significantly improve skill utilization.

Fixed skill format. Skills are always single Markdown files. More complex formats — nested folder structures like Anthropic's full SKILL.md spec, or programmatic skills with executable code — could encode richer knowledge.

Training cost. 3-5 days on 16 H100 GPUs is substantial. Each training step requires rolling out entire task groups through the executor, which involves multiple LLM inference calls. More efficient training methods (offline RL, distillation from larger curators) could reduce this cost.

Catastrophic forgetting. The paper does not explore what happens when the task distribution shifts. A curator trained on ALFWorld heating tasks might produce inappropriate skills when suddenly faced with navigation tasks. Continual learning of the curator is an open problem.

The bigger picture: what this enables

SkillOS demonstrates a powerful principle: you can train a small model to be an excellent specialist. An 8B model trained specifically for skill curation outperforms a frontier model doing the same task zero-shot. This suggests a future architecture for AI agents:

This multi-agent modular design mirrors how human organizations work: senior engineers solve problems, technical writers maintain the knowledge base, and everyone benefits from shared documentation.

The Self-Evolving Agent Landscape

Where SkillOS fits among memory-based agent systems. Axes: curation automation (x) and feedback horizon (y).

Related reading

To deepen your understanding of the ideas in this paper:

"What I cannot create, I do not understand. What I cannot curate, I cannot scale."
— Adapted from Richard Feynman

What is SkillOS's key contribution compared to all prior work?