SkillOS (Ouyang et al. 2026)

Chapter 0: The One-Off Problem

Imagine you're a new hire at a help desk. Your first week, a customer asks how to reset their password. You figure it out from scratch — reading docs, trying things, eventually solving it. Twenty minutes later, another customer asks the same question. You start from scratch again. No notes, no checklist, no memory of what just worked.

That is how virtually every LLM agent operates today. Each task arrives. The agent reasons, takes actions, maybe succeeds. Then the interaction ends, the context window is discarded, and the next task begins with a completely blank slate.

This is not a resource limitation — it is an architectural choice. Current agents are stateless by design. They have no mechanism to extract lessons from Task 1 and apply them to Task 47. Every problem is solved as if it were the very first problem the agent has ever encountered.

The cost of starting from scratch

This one-off pattern wastes effort in three concrete ways:

Redundant exploration. If an agent discovers that "navigate to desk, then examine mug under desklamp" is the correct strategy for inspection tasks in a household environment, that insight evaporates. The next inspection task triggers the same trial-and-error search. In ALFWorld (a text-based household benchmark), agents without memory take 21.1 interaction steps on average per task. With SkillOS's learned skills, that drops to 18.9 — a 10.4% reduction.

Repeated failures. Worse than re-exploring, stateless agents repeat the same mistakes. If a particular approach fails on Task 5, nothing prevents the agent from trying the identical approach on Task 32. There is no procedural memory to encode "when NOT to do X."

No capability growth. A human help desk agent after 1,000 tickets is dramatically better than after their first. A stateless LLM agent after 1,000 tasks is identical to the one that started. The curve is flat.

The One-Off Agent: No Memory Persists

Watch an agent solve sequential tasks. Each time, it starts from scratch. The blue bar shows current knowledge — it resets to zero between tasks. Click "Run Stream" to see the waste.

Ready

The canvas shows the tragedy clearly. Every colored block is exploration effort. Every time the bar resets to zero, that effort is lost. A self-evolving agent would carry forward the insight from each task, building a rising curve of capability instead of a flat line.

What would a self-evolving agent look like?

The ideal is an agent that maintains a growing library of reusable skills — Markdown files containing workflows, constraints, and heuristics extracted from past experience. When a new task arrives, the agent retrieves relevant skills from this library and executes more efficiently. After each task, it updates the library with new lessons.

The paper by Ouyang et al. (2026) proposes SkillOS, an RL training recipe that teaches an 8B-parameter model to perform this skill curation — deciding what to insert, what to update, and what to delete from the skill library. The trained curator outperforms even Gemini-2.5-Pro used directly as a curator, demonstrating that targeted training of a small model can beat raw scale.

The core insight: The bottleneck for self-evolving agents is not skill execution (applying skills to tasks) — it is skill curation (deciding what to extract, when to update, and when to delete). SkillOS isolates this curation responsibility into a trainable module and uses RL to optimize it.

An LLM agent solves 100 sequential tasks without any memory mechanism. On task #100, how does its performance compare to task #1?

Essentially identical — no knowledge accumulates between tasks Slightly better — the model implicitly learns from the context Much better — the agent naturally improves with experience

Chapter 1: Skills as Memory

If we want agents to accumulate experience, we need a format for storing it. Not raw trajectories — those are too long and too specific. Not abstract summaries — those lose the actionable detail. We need something in between: structured, reusable, retrievable.

SkillOS follows a design inspired by Anthropic's SKILL.md format: each skill is a single Markdown file stored in an external repository. The file has two parts:

1. YAML frontmatter — specifies the skill name and a natural-language description of when to use it. This is what the retrieval system matches against.

2. Markdown body — contains the executable knowledge: workflows, constraints, prerequisites, and heuristics. The paper suggests three sections as a starting point, but allows the curator to create additional sections as it learns.

Anatomy of a skill file

Here is a real skill that SkillOS's trained curator produced for ALFWorld inspection tasks:

markdown
---
name: Use light source to examine
description: Ensure object is examined
  under proper light source by navigating
  to the correct lamp location first
---

# Workflow
1. Navigate to the light source (desklamp,
   floorlamp) location first
2. Pick up the target object
3. Use the "examine" action with the light
   source, not the object

# When NOT to Use
- If the light source is not in the
  current room
- If the object is already being examined

# Prerequisite Constraints
- Agent must have free hands
- Light source must be turned on

Notice three critical properties of this format:

Retrievable. The YAML frontmatter contains a description that can be matched against incoming tasks using BM25 (a standard text retrieval algorithm). When a new task says "examine the mug under desklamp," BM25 matches it against "Use light source to examine" and retrieves this skill.

Actionable. The workflow section gives step-by-step instructions the executor can follow directly. This is not abstract wisdom ("inspection is important") — it is a concrete recipe ("navigate to lamp first, then pick up, then examine").

Guarded. The "When NOT to Use" section prevents misapplication. This is crucial: a skill that fires on the wrong task actively hurts performance by misleading the executor.

Why Markdown? LLMs already understand Markdown natively. No special parsing, no embedding conversion, no schema definition. The skill files are injected directly into the executor's prompt context. The model reads them like instructions from a colleague.

Skill File Anatomy

A structured skill file with its two components. Hover or tap sections to highlight their purpose.

The SkillRepo as a whole

The complete skill collection is called the SkillRepo, denoted S_t at time step t. It is simply a set of N_t Markdown files:

S_t = { s₁, s₂, ..., s_{N_t} }

The SkillRepo starts empty (S₀ = {}) and grows as the curator processes task trajectories. Three operations modify it:

Operation	Function Call	Effect
Insert	insert_skill(name, content)	Creates a new .md file in the repo
Update	update_skill(name, content)	Replaces the content of an existing file
Delete	delete_skill(name)	Removes a file from the repo

These are implemented as function calls — the curator generates structured JSON that specifies the operation, the target file, and (for insert/update) the new content. The system executes them against the SkillRepo, exactly like file I/O operations in an operating system. This is where the name "SkillOS" comes from.

The OS analogy: Just as an operating system manages files on disk — creating, modifying, deleting — SkillOS manages skill files in a repository. The curator is the process making system calls. The SkillRepo is the filesystem. The executor is the application that reads from it.

A skill file in SkillOS has two main components. What are they?

Python code + unit tests YAML frontmatter (name/description) + Markdown body (workflows/constraints) Embedding vector + raw trajectory

Chapter 2: The Curation Bottleneck

You might think that once we have the skill format and the three operations (insert, update, delete), the problem is solved. Just prompt the LLM: "Given this task trajectory, produce skill operations." Systems like ReasoningBank and MemP do exactly this — they use heuristic rules or prompted LLMs to manage memory.

It does not work well. The fundamental problem is that curation quality has delayed, indirect feedback.

The delayed feedback problem

Consider a concrete scenario. The agent just completed Task 12 in ALFWorld — "Put a heated egg on the counter." The curator observes the trajectory and decides to insert a skill about heating objects in the microwave. Was this a good decision?

We cannot know until Task 37, when another heating task arrives and the executor retrieves and applies this skill. If Task 37 succeeds faster because of the skill, the insert was good. If the skill's instructions were slightly wrong and caused the executor to fail, the insert was bad. Either way, the feedback arrives 25 tasks later and is mixed with dozens of other confounding factors.

This is fundamentally different from the executor's learning signal. The executor gets reward immediately: "Did you complete the current task? Yes/No." The curator's reward is: "Did the skill you wrote 25 tasks ago help a different task that happened to need it?" That signal is delayed, sparse, and noisy.

The credit assignment problem: When the executor succeeds on Task 37, who gets credit? The executor that took the right actions? The retriever that found the right skill? The curator that wrote the skill 25 tasks ago? All three contributed, but the curator's contribution is the hardest to isolate.

Why heuristics fail

Existing approaches use fixed rules for skill curation. Some examples from the literature:

"Always insert after a successful task." This floods the repo with redundant, overlapping skills. If ten similar tasks all succeed, you get ten nearly-identical skill files that confuse retrieval.

"Delete skills that haven't been used in K tasks." This kills skills that are rare but crucial. A skill for "heating objects in microwave" might only trigger once every 20 tasks, but when it fires, it is essential.

"Update a skill if the executor failed while using it." This conflates skill quality with task difficulty. The executor might have failed because the task was genuinely hard, not because the skill was wrong.

The common thread: heuristics make local decisions without downstream performance feedback. They cannot learn that "inserting a concise skill with a 'When NOT to Use' section leads to 12% higher success on future related tasks" because they never see the downstream outcome.

Insert is easy. Update and delete are hard.

The paper makes an important empirical observation: untrained curators (SkillOS-base) overwhelmingly choose insert. Figure 4 in the paper shows that at the start of training, insert accounts for nearly 100% of all operations. The curator just blindly adds new skills after every task.

This makes intuitive sense. Insert is the "safe" operation — you are adding information, not destroying it. Update requires judging which part of an existing skill is wrong and how to fix it. Delete requires judging that a skill is actively harmful or redundant — a harder call than "this seems useful."

RL training shifts this distribution dramatically. By the end of training, update operations account for a growing fraction (from ~0% to ~25%), and delete begins appearing as well. The curator learns that skill quality matters more than skill quantity.

Curation Decision Space

Three curation operations and what can go wrong with each. Click each operation to see the failure modes that make curation hard.

The training signal gap: Prior RL work on skills (SkillRL, D2Skill) focuses on teaching agents to use skills, not to curate them. ARISE trains retrieval + execution jointly but uses heuristics for management. SkillOS is the first to train the curation policy end-to-end via RL with long-horizon feedback.

Why is skill curation harder to learn than skill execution?

Because curation requires a larger model Because curation is computationally expensive Because curation feedback is delayed and indirect — the quality of a skill is only revealed when a future task uses it

Chapter 3: SkillOS Architecture

SkillOS's key architectural insight is separation of concerns. Instead of training one monolithic model to do everything — solve tasks, extract lessons, manage the skill library — SkillOS splits the system into two independent modules with a shared data structure between them.

The three components

1. Agent Executor (π_L) — a frozen LLM that solves tasks. Given a task description x_t, the current environment observation o_t, and a set of retrieved skills S̃_t, the executor produces actions:

a ~ π_L( · | x_t, o_t, S̃_t )

The executor is frozen throughout training — its weights never change. This is deliberate: we want to test whether better skills improve performance, not whether a better executor does. The executor can be any model: Qwen3-8B, Qwen3-32B, or Gemini-2.5-Pro. SkillOS's trained curator generalizes across all of them.

2. Skill Curator (π_S) — a trainable LLM (Qwen3-8B base) that manages the SkillRepo. After the executor completes a task, the curator observes:

The full execution trajectory ξ_t = (o₁, a₁, ..., o_n, a_n)
Self-judged correctness indicator 1_{ξ_t}
Retrieved subset of existing skills S̃_t

And generates a sequence of curation operations:

c_t = (u₁, ..., u_M) ~ π_S( · | ξ_t, 1_{ξ_t}, S̃_t )

Each u_m is a function call: insert_skill, update_skill, or delete_skill. These are structured JSON outputs that the system executes against the SkillRepo.

3. SkillRepo (S_t) — the external skill repository. A collection of Markdown files that grows, changes, and (sometimes) shrinks as the curator operates. Skills are retrieved via BM25 matching against the task description.

Why freeze the executor? If we trained both the executor and curator jointly, we could never tell whether improvements came from better skills or a better executor. By freezing the executor, any performance gain is directly attributable to skill quality. This is the key insight for learning curation: ground curation in downstream executor performance.

The streaming loop

At deployment, SkillOS processes a stream of tasks D = {x₁, x₂, ..., x_T} sequentially. For each task:

1. Retrieve

BM25 matches task description x_t against SkillRepo S_t to get relevant skills S̃_t

↓

2. Execute

Frozen executor π_L solves the task using retrieved skills, producing trajectory ξ_t

↓

3. Curate

Trained curator π_S observes trajectory + result, generates insert/update/delete operations

↓

4. Update

Operations applied to SkillRepo: S_t+1 = ApplyOps(S_t, c_t)

↻ next task

This forms a closed loop: the executor's performance depends on the skills the curator produced, and the curator learns from the executor's subsequent performance. The SkillRepo is the shared memory that mediates between them.

SkillOS Streaming Loop

The full SkillOS pipeline processing a stream of tasks. Click "Step" to advance through the loop, or "Auto" to watch it cycle. Toggle "Without Curation" to see a no-memory baseline.

Ready — click Step to begin

Data flow: what goes where

Let's trace the exact data flowing through the system for one task:

Component	Input	Output	Trainable?
BM25 Retriever	Task desc x_t + SkillRepo S_t	Top-k skills S̃_t	No (fixed algorithm)
Executor π_L	x_t + observations + S̃_t	Trajectory ξ_t	No (frozen)
Self-judge	Trajectory ξ_t	Binary correctness 1_ξ	No (LLM-as-judge)
Curator π_S	ξ_t + 1_ξ + S̃_t	Operations c_t	Yes (RL-trained)
ApplyOps	S_t + c_t	S_t+1	No (deterministic)

Only one component is trainable: the curator. Everything else is fixed. This tight bottleneck means all learning signal flows through a single policy, making optimization tractable.

In SkillOS, which component is trained via RL?

The agent executor The skill curator only Both the executor and the curator jointly

Chapter 4: Grouped Task Streams

We established that curation feedback is delayed and indirect. The curator inserts a skill now, and its value is revealed only when a related future task benefits from it. So how do we construct training data that provides this feedback?

SkillOS's first key design: grouped task streams. Instead of training on random sequences of tasks, SkillOS groups related tasks together and trains on entire groups as single instances.

Step 1: Annotate tasks with skill-relevant tags

For each task x_i in the training set, SkillOS uses Gemini-2.5-Pro to produce a set of tags:

Z_i = { z₁, z₂, ..., z_{|Z_i|} }

Each tag z captures a salient aspect of the task — topic, strategy, common pitfall. For ALFWorld, these are the built-in task type annotations (Pick, Clean, Heat, Cool, etc.). For reasoning tasks like MATH, tags might be "algebra," "Fourier transformation," or "inequality manipulation."

Step 2: Partition into groups

Based on tag similarity, SkillOS partitions the full training set D into M groups:

D = {G₁, G₂, ..., G_M}, G_m = {x_m,1, x_m,2, ..., x_{m,|G_m|}}

All tasks within a group share non-trivial skill dependencies — they are the kind of tasks where solving one should help solve the others.

Step 3: Train on groups

Each training step samples one group G_m and starts with an empty SkillRepo. The system then iterates through the group's tasks sequentially:

Task 1: Executor solves it (no skills available). Curator processes the trajectory and inserts initial skills.
Task 2: Executor retrieves skills from the now-populated SkillRepo. Curator observes the trajectory and may insert more, update existing skills, or delete unhelpful ones.
...
Task |G|: Executor benefits from all accumulated skills. This task's success/failure is the strongest signal about curation quality.

The first task in each group always uses an empty SkillRepo, so its outcome is independent of curation. The task outcome reward is therefore computed only over tasks 2 through |G|:

r_task = (1 / (|G| - 1)) ∑_i=2^|G| 1(ξ_i)

This is the core trick: by grouping related tasks, the paper creates a within-group feedback loop where earlier curation decisions are evaluated by later task outcomes. The curator learns to write skills that help on future related tasks, not just skills that describe the current task.

Why grouping matters: Without grouping (random task order), there is no guarantee that a skill extracted from Task A will ever be relevant to any future task in the training sequence. Grouping ensures that each training instance contains multiple tasks that share skill dependencies, providing dense, relevant feedback for curation decisions. Ablations show that removing grouping drops ALFWorld success from 61.2% to 57.3%.

Grouped Task Streams

Related tasks are clustered into groups. Early tasks (darker) generate skills; later tasks (lighter) evaluate them. Click a group to see how skills transfer within it.

Contrast with prior work

Prior RL-based skill methods like ARISE and UMEM train on short task streams — often just 2 adjacent tasks. This limits the density of feedback: the curator only sees whether a skill helped the immediately next task. SkillOS's longer grouped streams (|G| = 4-8 tasks) expose the curator to multi-hop skill evolution, where:

Skill A is inserted after Task 1
Skill A is retrieved and fails on Task 3, so the curator updates it
The updated Skill A' succeeds on Task 5

This three-step feedback arc — insert, fail, update, succeed — cannot be learned from 2-task windows. Grouped streams provide the trajectory length needed to learn update and delete behaviors.

Why does SkillOS exclude Task 1 from the task outcome reward?

Because Task 1 uses an empty SkillRepo, so its outcome is independent of curation quality Because Task 1 is always easy Because the curator doesn't run on Task 1

Chapter 5: Composite Rewards

Grouped task streams provide the structure for learning curation. But we also need the right reward signal. A single "did the downstream task succeed?" reward is too sparse — the curator makes dozens of micro-decisions (which section to write, how verbose to be, whether to include "When NOT to Use") and needs finer-grained feedback.

SkillOS addresses this with a composite reward that combines four signals, each targeting a different failure mode:

r = r_task + λ_f · r_fc + λ_u · r_cnt + λ_c · r_comp

With weights λ_f = 1.0, λ_u = 0.1, λ_c = 0.05. Let's examine each component.

1. Task outcome reward (r_task)

The primary signal. Average success rate over evaluation tasks (tasks 2 through |G|):

r_task = (1 / (|G| - 1)) ∑_i=2^|G| 1(ξ_i)

What it catches: The overall quality of the curated SkillRepo. If the skills are good, downstream tasks succeed more often.

What it misses: Everything about HOW the curator produced those skills. A curator that writes valid, well-structured skills that happen not to be relevant to the evaluation tasks gets r_task = 0. We need additional signals to guide learning when task outcomes are uninformative.

2. Function call validity reward (r_fc)

Measures whether the curator produces valid, executable function calls:

r_fc = (1 / |G|) ∑_i=1^|G| Valid(c_i)

Where Valid(c_i) is the fraction of function calls in curation decision c_i that parse correctly and execute successfully. An insert_skill call that references a malformed filename, or an update_skill call targeting a non-existent file, gets a score of 0.

What it catches: Formatting errors, hallucinated filenames, invalid JSON. Without this signal, the curator might spend many early training steps producing outputs that fail to execute at all.

3. Content quality reward (r_cnt)

Uses an external judge (Qwen3-32B) to evaluate whether curated skills are semantically meaningful and likely useful:

r_cnt = (1 / |G|) ∑_i=1^|G| Judge(c_i)

What it catches: Low-quality content. A skill that just copies the raw trajectory verbatim gets a low judge score. A skill that extracts a clean, generalizable workflow gets a high one. This intermediate supervision is critical in a pipelined system where the curator never directly sees downstream task outcomes.

Ablating r_cnt drops ALFWorld success from 61.2% to 58.6% — the largest drop among the auxiliary rewards.

4. Compression reward (r_comp)

Discourages verbatim trajectory copying by rewarding concise repository updates:

r_comp = (1 / |G|) ∑_i=1^|G| (1 - |S_i| / |χ_i|)

Where |S_i| is the token length of the SkillRepo after applying operations at step i, and |χ_i| is the token length of the curator's input context. If the skills are shorter than the input (good — we compressed), the reward is positive. If the skills are longer than the input (bad — we're storing raw trajectories), the reward is negative.

What it catches: Bloated repositories. An important failure mode is the curator copying entire trajectories into skill files instead of distilling them into concise instructions. The compression reward explicitly penalizes this.

Each reward handles a different failure mode: r_task catches bad content, r_fc catches formatting errors, r_cnt catches low-quality or trivial skills, and r_comp catches bloated repositories. Together, they turn a sparse, delayed signal into dense, multi-faceted supervision.

Composite Reward Breakdown

Adjust the sliders to see how each reward component contributes to the total. The paper uses λ_f=1.0, λ_u=0.1, λ_c=0.05.

r_task 0.70

r_fc 0.90

r_cnt 0.60

r_comp 0.50

Weight tuning

The paper sets λ_f = 1.0 (function call validity weighted equally with task outcome), λ_u = 0.1 (content quality is a soft guide, not a hard constraint), and λ_c = 0.05 (compression is a gentle nudge). This weighting makes sense: task outcome is the ground truth, function calls must be valid for anything to work, content quality is informative but subjective, and compression is a nice-to-have.

The ablation study shows that removing which auxiliary reward causes the largest performance drop on ALFWorld?

Content quality reward (r_cnt) — drops from 61.2% to 58.6% Compression reward (r_comp) — drops from 61.2% to 60.0% Function call reward (r_fc)

Chapter 6: GRPO Training

Now we have the training structure (grouped task streams) and the reward signal (composite reward). How do we actually optimize the curator policy? SkillOS uses Group Relative Policy Optimization (GRPO), an RL algorithm originally developed for DeepSeek-Math.

Why GRPO?

Standard policy gradient methods like PPO require a separate critic network — a value function that estimates expected future reward from each state. Training a critic for skill curation is problematic because:

The state space is enormous (entire SkillRepo contents + trajectory)
The reward is delayed across multiple tasks in the group
Critic architecture and training schedule become additional hyperparameters

GRPO eliminates the critic entirely. Instead, it estimates advantages by comparing multiple rollouts of the same task group against each other.

How GRPO works

For each task group G, SkillOS samples N independent rollouts from the curator policy. Each rollout produces a different sequence of curation decisions, which leads to a different SkillRepo evolution, which leads to different executor outcomes. This gives N composite reward values {r₁, r₂, ..., r_N}.

The advantage for rollout n is simply:

A_n = r_n - (1/N) ∑_n'=1^N r_n'

That is: "How much better (or worse) was this rollout compared to the average?" No critic, no value function, just relative comparison within the group.

The policy is then updated with a clipped surrogate objective (same as PPO's clipping):

L = E_n[ min(ρ_n A_n, clip(ρ_n, 1-ε, 1+ε) A_n) ]

Where ρ_n = π_S(cⁿ | χ) / π_θold(cⁿ | χ) is the importance ratio between the current and old policy. The clipping prevents the policy from changing too drastically in one step.

No KL penalty: Unlike standard GRPO implementations, SkillOS discards the KL divergence term that penalizes deviation from the reference policy. This is deliberate: the authors want to encourage exploration. Early in training, the curator needs to experiment with update and delete operations it has never tried before. A KL penalty would anchor it to the insertion-heavy initial behavior.

Training details

Hyperparameter	Value
Base model for π_S	Qwen3-8B
Executor during training	Qwen3-8B (frozen)
Learning rate	1 × 10^-6
Batch size	32 (task groups per batch)
Group size N (rollouts per group)	8
Hardware	16 × H100 GPUs
Training time (ALFWorld)	~3 days
Training time (WebShop)	~5 days
Training time (Reasoning)	~2.5 days
Framework	verl (HybridFlow)

The training trajectory

The paper provides a fascinating view of how the curator evolves during training (Figure 4). The operation distribution tells the story:

Early training (steps 1-10): Insert dominates at ~95%. The curator knows only one move: "See trajectory, write new skill." This is the naive behavior — pure expansion.

Mid training (steps 10-30): Update grows to ~25%. The curator learns that revising existing skills is more valuable than creating new ones. It starts recognizing when an existing skill almost matches but needs refinement.

Late training (steps 30+): Delete appears at ~5-8%. The curator learns to prune redundant or harmful skills. The SkillRepo becomes more curated, not just larger.

GRPO Training: Advantage from Group Comparisons

N rollouts of the same task group produce different rewards. Advantages are computed relative to the group mean. Drag the slider to change the number of rollouts.

Rollouts (N) 8

Worked example: one training step

Let's trace a single training step. The batch samples group G = {Heat Egg, Heat Mug, Heat Apple, Heat Potato}. SkillRepo starts empty.

Rollout 1: After Task 1, curator inserts "Heating objects workflow" skill. Tasks 2-4 all succeed using this skill. r_task = 1.0. Total r = 1.0 + 0.92 + 0.07 + 0.04 = 2.03.

Rollout 2: After Task 1, curator inserts a very verbose skill (copies entire trajectory). Task 2 succeeds but slowly (executor confused by long skill). Task 3 fails. Task 4 succeeds. r_task = 0.67. Compression reward low (0.2). Total r = 0.67 + 0.85 + 0.05 + 0.01 = 1.58.

Rollout 3: Curator produces invalid JSON for the insert call. No skills are added. Tasks 2-4 run without skills. r_task = 0.33. r_fc = 0. Total r = 0.33 + 0 + 0 + 0.05 = 0.38.

Mean reward: (2.03 + 1.58 + 0.38) / 3 = 1.33. Advantages: A₁ = +0.70, A₂ = +0.25, A₃ = -0.95. GRPO reinforces Rollout 1's behavior and suppresses Rollout 3's.

What is the key advantage of GRPO over PPO for training the skill curator?

GRPO eliminates the need for a separate critic/value network by computing advantages from group comparisons GRPO trains faster on fewer GPUs GRPO can handle continuous action spaces

Chapter 7: Results

SkillOS is evaluated across three benchmark categories with multiple executor backbones. The results tell a consistent story: trained curation beats both no-memory baselines and heuristic-based memory systems.

ALFWorld: Multi-turn household tasks

ALFWorld is a text-based environment where agents navigate rooms, manipulate objects, and complete household tasks ("Put a heated egg on the counter," "Examine the mug under desklamp"). There are 6 task subtypes: Pick, Look, Clean, Heat, Cool, and Pick2. Results are reported as success rate (SR) and average interaction steps.

With Qwen3-8B as executor:

Method	Avg SR (%)	Steps
No Memory	47.9	21.1
ReasoningBank	55.7	20.1
MemP	49.7	21.0
SkillOS-base (no RL)	53.1	20.4
SkillOS-gemini (Gemini curator)	50.7	20.8
SkillOS	61.2	18.9

Three things stand out. First, SkillOS beats the strongest baseline (ReasoningBank) by +5.5 absolute points. Second, SkillOS reduces interaction steps from 21.1 to 18.9 — the agent is not just more successful, it is faster. Third, the RL-trained 8B curator outperforms Gemini-2.5-Pro used directly as curator (SkillOS-gemini: 50.7%). A small, targeted model beats a frontier model at this specific skill.

8B beats Gemini-2.5-Pro: SkillOS-gemini uses Gemini-2.5-Pro as the curator — a frontier model with far more parameters and reasoning capability. Yet SkillOS's trained 8B curator scores 61.2% vs. 50.7%. This demonstrates that targeted RL training on the specific curation task outweighs raw model scale. The frontier model writes plausible-looking skills, but they may not match what the executor actually needs.

WebShop: Online shopping tasks

WebShop simulates an online shopping environment. The agent navigates a web interface to find and purchase products matching user specifications. Metrics: score, success rate (SR), and interaction steps.

Method	Score	SR (%)	Steps
No Memory	33.3	9.8	20.3
ReasoningBank	35.4	11.4	20.5
SkillOS-base	38.6	13.6	20.1
SkillOS	40.6	16.5	19.4

SkillOS improves SR from 9.8% (no memory) to 16.5% — a 68% relative improvement. The gains are even more dramatic with stronger executors: with Gemini-2.5-Pro as executor, SkillOS reaches 41.3% SR vs. 38.4% for no memory.

Reasoning tasks: AIME24, AIME25, GPQA-Diamond

Single-turn reasoning tasks show more modest gains, but SkillOS still improves consistently:

Method	AIME24	AIME25	GPQA	Avg
No Memory	76.0	71.1	61.8	69.6
ReasoningBank	75.4	73.2	60.3	69.6
SkillOS	80.0	76.7	64.6	73.8

The gains are smaller (+4.2 average accuracy) because reasoning tasks benefit from more abstract skill types (decomposition heuristics, verification patterns) that are harder to capture in procedural skills. Still, SkillOS is the only method that consistently improves over no-memory.

Cross-executor and cross-task transfer

A crucial test: does a curator trained with Qwen3-8B executor transfer to different executors? Yes. SkillOS lifts Gemini-2.5-Pro's ALFWorld SR from 66.4% to 80.2% — a +13.8 improvement, even though the curator never saw this executor during training.

Cross-task transfer (Figure 3 in the paper) also works: a curator trained on reasoning tasks improves ALFWorld performance by +13.3 with Qwen3-8B executor. The reasoning-trained curator learns abstract strategies (decomposition, verification, adaptive planning) that transfer to agentic tasks.

Results Dashboard

Performance comparison across methods and benchmarks. Click a benchmark to see detailed results.

When SkillOS's 8B curator (trained with Qwen3-8B executor) is paired with Gemini-2.5-Pro executor, what happens?

Performance degrades because of curator-executor mismatch Performance improves — the curator generalizes across different executor backbones Performance stays the same as no-memory baseline

Chapter 8: Emergent Skills

The most fascinating finding in the paper is not the performance numbers — it is what happens inside the SkillRepo as training progresses. The curator does not just get better at inserting skills. It develops an entirely new organizational structure that was never explicitly programmed.

New Markdown sections emerge

The skill format suggests three sections: Workflow, When NOT to Use, and Prerequisites. But SkillOS's trained curator creates additional sections that were never specified. Figure 5(a) in the paper tracks these emergent sections across training:

Early training: The curator adds generic sections — "Additional Guidance," "Tips and Recommendations," "Enhancement." These are verbose and add little operational value. They are the model's default verbosity patterns.

Late training: The sections become execution-oriented:

"Failure & Error Handling" — What to do when the standard workflow fails
"Retry Logic" — When and how to retry with different parameters
"Special Considerations" — Edge cases that require different approaches
"Alternative Approaches" — Backup strategies when the primary workflow is blocked

RL gradually steered the curator from superficial enrichment toward execution-oriented skill refinement. The curator learned — through trial and error — that a "Retry Logic" section makes the executor more robust, while a "Tips" section just adds noise.

Meta-skills emerge

Even more remarkable: the SkillRepo develops skills about skills. Figure 5(b) tracks the evolution of skill categories:

Early SkillRepo: Dominated by narrow, task-specific skills. "How to heat an egg." "How to clean a mug." Each skill covers exactly one task variant.

Late SkillRepo: A diverse mix including meta-strategy skills:

State verification skills (dominate at >50%) — "Always verify the object is in your inventory before attempting an action"
Systematic search skills (~13%) — "When the target object is not found, search containers systematically"
Failure recovery skills (~9%) — "If an action fails, try the alternative location"
Generic action skills (~8%) — Reusable action patterns that apply across task types

The task-specific skills (e.g., "task-object specific," "task-location specific") shrink from dominating the repo to occupying less than 30%. The curator discovers that abstract, compositional skills are more valuable than narrow, task-specific ones.

Organizational discovery: Nobody told the curator to create meta-skills. Nobody specified categories like "state verification" or "failure recovery." The curator discovered through RL that a repo organized around reusable strategies outperforms one organized around specific tasks. This is emergent structure from reward optimization.

Skill utilization becomes more targeted

Figure 6 in the paper compares skill usage statistics between SkillOS-base and SkillOS:

Metric	SkillOS-base	SkillOS
Skill usage rate	87.9%	100%
Successful skill usage rate	53.6%	61.2%
Skill coverage	72.9%	88.6%
Avg skills per example	2.24	1.95

SkillOS invokes skills on 100% of evaluation examples (vs. 87.9% for the base) and achieves higher success when doing so. Crucially, it uses fewer skills per example (1.95 vs. 2.24) while achieving better coverage of the repo (88.6% vs. 72.9%). The trained curator produces skills that are more precisely targeted — less noise, more signal.

SkillRepo Evolution

Watch how the SkillRepo changes over training steps. Skills appear, merge, specialize, and develop meta-structure. Click "Evolve" to advance training.

Training step 0 — empty SkillRepo

A concrete example: skill merging

Early in training, the curator might create separate skills for "Heat egg in microwave" and "Heat mug in microwave." These have 80% identical content — both involve finding the microwave, putting the object in, and turning it on. Only the object name differs.

After RL training, the curator learns to create a single "Heat objects using microwave" skill with a conditional: "Works for any heatable object (egg, mug, apple, potato). Verify object is picked up before approaching microwave." This merged skill is more compact, more retrievable (matches more queries), and easier for the executor to follow.

The compression reward (r_comp) nudges this behavior, but it is the task outcome reward (r_task) that truly drives it: a merged skill that covers four task variants produces better outcomes than four fragmented skills that might or might not be retrieved.

What is the most common type of skill in a mature SkillOS SkillRepo?

Task-specific skills for individual objects or locations State verification skills — abstract strategies that apply across task types Raw trajectory copies

Chapter 9: Connections

SkillOS sits at the intersection of several active research areas. Let's map where it fits and what comes next.

The self-evolving agent landscape

System	Memory Type	Curation Method	Key Difference from SkillOS
ReAct	None (stateless)	N/A	No memory at all — pure reasoning + acting
Reflexion	Text reflections	Prompted LLM	Stores verbal self-critiques, not reusable skills
ReasoningBank	Distilled insights	Prompted LLM	No RL training — heuristic curation only
MemP	Procedural memory	Heuristic operations	Fixed rules for memory management
Anthropic Skills	SKILL.md folders	Manual curation	Human-written skills — no automation
ARISE	Skill library	RL (retrieval + use)	Trains retrieval + execution, heuristic management
SkillRL / D2Skill	Pre-curated skills	RL (use only)	Trains agents to use skills, not to curate them
SkillOS	Markdown skills	RL (curation)	First to train curation end-to-end via long-horizon RL

SkillOS's unique position: All prior systems either (a) use heuristic curation, (b) train skill usage but not curation, or (c) train on short-horizon feedback. SkillOS is the first to train the curation policy itself with long-horizon, executor-grounded RL feedback.

Limitations and open questions

Retrieval bottleneck. SkillOS uses BM25 for skill retrieval — a lexical matching algorithm that cannot capture semantic similarity. A skill titled "Systematic Container Search" would not match a query about "finding hidden objects." Dense retrieval (embedding-based) could significantly improve skill utilization.

Fixed skill format. Skills are always single Markdown files. More complex formats — nested folder structures like Anthropic's full SKILL.md spec, or programmatic skills with executable code — could encode richer knowledge.

Training cost. 3-5 days on 16 H100 GPUs is substantial. Each training step requires rolling out entire task groups through the executor, which involves multiple LLM inference calls. More efficient training methods (offline RL, distillation from larger curators) could reduce this cost.

Catastrophic forgetting. The paper does not explore what happens when the task distribution shifts. A curator trained on ALFWorld heating tasks might produce inappropriate skills when suddenly faced with navigation tasks. Continual learning of the curator is an open problem.

The bigger picture: what this enables

SkillOS demonstrates a powerful principle: you can train a small model to be an excellent specialist. An 8B model trained specifically for skill curation outperforms a frontier model doing the same task zero-shot. This suggests a future architecture for AI agents:

Large executor — a frontier model that handles diverse tasks
Small, specialized curator — an RL-trained model that manages the executor's skill library
Shared skill repository — the memory substrate that mediates between them

This multi-agent modular design mirrors how human organizations work: senior engineers solve problems, technical writers maintain the knowledge base, and everyone benefits from shared documentation.

The Self-Evolving Agent Landscape

Where SkillOS fits among memory-based agent systems. Axes: curation automation (x) and feedback horizon (y).

SkillOS: Learning Skill Curation for Self-Evolving Agents