LLM agents solve tasks one at a time, throwing away everything they learned. SkillOS trains a dedicated curator to build, refine, and prune a reusable skill library via RL — turning one-off problem solvers into agents that get better with every interaction.
Imagine you're a new hire at a help desk. Your first week, a customer asks how to reset their password. You figure it out from scratch — reading docs, trying things, eventually solving it. Twenty minutes later, another customer asks the same question. You start from scratch again. No notes, no checklist, no memory of what just worked.
That is how virtually every LLM agent operates today. Each task arrives. The agent reasons, takes actions, maybe succeeds. Then the interaction ends, the context window is discarded, and the next task begins with a completely blank slate.
This is not a resource limitation — it is an architectural choice. Current agents are stateless by design. They have no mechanism to extract lessons from Task 1 and apply them to Task 47. Every problem is solved as if it were the very first problem the agent has ever encountered.
This one-off pattern wastes effort in three concrete ways:
Redundant exploration. If an agent discovers that "navigate to desk, then examine mug under desklamp" is the correct strategy for inspection tasks in a household environment, that insight evaporates. The next inspection task triggers the same trial-and-error search. In ALFWorld (a text-based household benchmark), agents without memory take 21.1 interaction steps on average per task. With SkillOS's learned skills, that drops to 18.9 — a 10.4% reduction.
Repeated failures. Worse than re-exploring, stateless agents repeat the same mistakes. If a particular approach fails on Task 5, nothing prevents the agent from trying the identical approach on Task 32. There is no procedural memory to encode "when NOT to do X."
No capability growth. A human help desk agent after 1,000 tickets is dramatically better than after their first. A stateless LLM agent after 1,000 tasks is identical to the one that started. The curve is flat.
Watch an agent solve sequential tasks. Each time, it starts from scratch. The blue bar shows current knowledge — it resets to zero between tasks. Click "Run Stream" to see the waste.
ReadyThe canvas shows the tragedy clearly. Every colored block is exploration effort. Every time the bar resets to zero, that effort is lost. A self-evolving agent would carry forward the insight from each task, building a rising curve of capability instead of a flat line.
The ideal is an agent that maintains a growing library of reusable skills — Markdown files containing workflows, constraints, and heuristics extracted from past experience. When a new task arrives, the agent retrieves relevant skills from this library and executes more efficiently. After each task, it updates the library with new lessons.
The paper by Ouyang et al. (2026) proposes SkillOS, an RL training recipe that teaches an 8B-parameter model to perform this skill curation — deciding what to insert, what to update, and what to delete from the skill library. The trained curator outperforms even Gemini-2.5-Pro used directly as a curator, demonstrating that targeted training of a small model can beat raw scale.
If we want agents to accumulate experience, we need a format for storing it. Not raw trajectories — those are too long and too specific. Not abstract summaries — those lose the actionable detail. We need something in between: structured, reusable, retrievable.
SkillOS follows a design inspired by Anthropic's SKILL.md format: each skill is a single Markdown file stored in an external repository. The file has two parts:
1. YAML frontmatter — specifies the skill name and a natural-language description of when to use it. This is what the retrieval system matches against.
2. Markdown body — contains the executable knowledge: workflows, constraints, prerequisites, and heuristics. The paper suggests three sections as a starting point, but allows the curator to create additional sections as it learns.
Here is a real skill that SkillOS's trained curator produced for ALFWorld inspection tasks:
markdown --- name: Use light source to examine description: Ensure object is examined under proper light source by navigating to the correct lamp location first --- # Workflow 1. Navigate to the light source (desklamp, floorlamp) location first 2. Pick up the target object 3. Use the "examine" action with the light source, not the object # When NOT to Use - If the light source is not in the current room - If the object is already being examined # Prerequisite Constraints - Agent must have free hands - Light source must be turned on
Notice three critical properties of this format:
Retrievable. The YAML frontmatter contains a description that can be matched against incoming tasks using BM25 (a standard text retrieval algorithm). When a new task says "examine the mug under desklamp," BM25 matches it against "Use light source to examine" and retrieves this skill.
Actionable. The workflow section gives step-by-step instructions the executor can follow directly. This is not abstract wisdom ("inspection is important") — it is a concrete recipe ("navigate to lamp first, then pick up, then examine").
Guarded. The "When NOT to Use" section prevents misapplication. This is crucial: a skill that fires on the wrong task actively hurts performance by misleading the executor.
A structured skill file with its two components. Hover or tap sections to highlight their purpose.
The complete skill collection is called the SkillRepo, denoted St at time step t. It is simply a set of Nt Markdown files:
The SkillRepo starts empty (S0 = {}) and grows as the curator processes task trajectories. Three operations modify it:
| Operation | Function Call | Effect |
|---|---|---|
| Insert | insert_skill(name, content) | Creates a new .md file in the repo |
| Update | update_skill(name, content) | Replaces the content of an existing file |
| Delete | delete_skill(name) | Removes a file from the repo |
These are implemented as function calls — the curator generates structured JSON that specifies the operation, the target file, and (for insert/update) the new content. The system executes them against the SkillRepo, exactly like file I/O operations in an operating system. This is where the name "SkillOS" comes from.
You might think that once we have the skill format and the three operations (insert, update, delete), the problem is solved. Just prompt the LLM: "Given this task trajectory, produce skill operations." Systems like ReasoningBank and MemP do exactly this — they use heuristic rules or prompted LLMs to manage memory.
It does not work well. The fundamental problem is that curation quality has delayed, indirect feedback.
Consider a concrete scenario. The agent just completed Task 12 in ALFWorld — "Put a heated egg on the counter." The curator observes the trajectory and decides to insert a skill about heating objects in the microwave. Was this a good decision?
We cannot know until Task 37, when another heating task arrives and the executor retrieves and applies this skill. If Task 37 succeeds faster because of the skill, the insert was good. If the skill's instructions were slightly wrong and caused the executor to fail, the insert was bad. Either way, the feedback arrives 25 tasks later and is mixed with dozens of other confounding factors.
This is fundamentally different from the executor's learning signal. The executor gets reward immediately: "Did you complete the current task? Yes/No." The curator's reward is: "Did the skill you wrote 25 tasks ago help a different task that happened to need it?" That signal is delayed, sparse, and noisy.
Existing approaches use fixed rules for skill curation. Some examples from the literature:
"Always insert after a successful task." This floods the repo with redundant, overlapping skills. If ten similar tasks all succeed, you get ten nearly-identical skill files that confuse retrieval.
"Delete skills that haven't been used in K tasks." This kills skills that are rare but crucial. A skill for "heating objects in microwave" might only trigger once every 20 tasks, but when it fires, it is essential.
"Update a skill if the executor failed while using it." This conflates skill quality with task difficulty. The executor might have failed because the task was genuinely hard, not because the skill was wrong.
The common thread: heuristics make local decisions without downstream performance feedback. They cannot learn that "inserting a concise skill with a 'When NOT to Use' section leads to 12% higher success on future related tasks" because they never see the downstream outcome.
The paper makes an important empirical observation: untrained curators (SkillOS-base) overwhelmingly choose insert. Figure 4 in the paper shows that at the start of training, insert accounts for nearly 100% of all operations. The curator just blindly adds new skills after every task.
This makes intuitive sense. Insert is the "safe" operation — you are adding information, not destroying it. Update requires judging which part of an existing skill is wrong and how to fix it. Delete requires judging that a skill is actively harmful or redundant — a harder call than "this seems useful."
RL training shifts this distribution dramatically. By the end of training, update operations account for a growing fraction (from ~0% to ~25%), and delete begins appearing as well. The curator learns that skill quality matters more than skill quantity.
Three curation operations and what can go wrong with each. Click each operation to see the failure modes that make curation hard.
SkillOS's key architectural insight is separation of concerns. Instead of training one monolithic model to do everything — solve tasks, extract lessons, manage the skill library — SkillOS splits the system into two independent modules with a shared data structure between them.
1. Agent Executor (πL) — a frozen LLM that solves tasks. Given a task description xt, the current environment observation ot, and a set of retrieved skills S̃t, the executor produces actions:
The executor is frozen throughout training — its weights never change. This is deliberate: we want to test whether better skills improve performance, not whether a better executor does. The executor can be any model: Qwen3-8B, Qwen3-32B, or Gemini-2.5-Pro. SkillOS's trained curator generalizes across all of them.
2. Skill Curator (πS) — a trainable LLM (Qwen3-8B base) that manages the SkillRepo. After the executor completes a task, the curator observes:
And generates a sequence of curation operations:
Each um is a function call: insert_skill, update_skill, or delete_skill. These are structured JSON outputs that the system executes against the SkillRepo.
3. SkillRepo (St) — the external skill repository. A collection of Markdown files that grows, changes, and (sometimes) shrinks as the curator operates. Skills are retrieved via BM25 matching against the task description.
At deployment, SkillOS processes a stream of tasks D = {x1, x2, ..., xT} sequentially. For each task:
This forms a closed loop: the executor's performance depends on the skills the curator produced, and the curator learns from the executor's subsequent performance. The SkillRepo is the shared memory that mediates between them.
The full SkillOS pipeline processing a stream of tasks. Click "Step" to advance through the loop, or "Auto" to watch it cycle. Toggle "Without Curation" to see a no-memory baseline.
Let's trace the exact data flowing through the system for one task:
| Component | Input | Output | Trainable? |
|---|---|---|---|
| BM25 Retriever | Task desc xt + SkillRepo St | Top-k skills S̃t | No (fixed algorithm) |
| Executor πL | xt + observations + S̃t | Trajectory ξt | No (frozen) |
| Self-judge | Trajectory ξt | Binary correctness 1ξ | No (LLM-as-judge) |
| Curator πS | ξt + 1ξ + S̃t | Operations ct | Yes (RL-trained) |
| ApplyOps | St + ct | St+1 | No (deterministic) |
Only one component is trainable: the curator. Everything else is fixed. This tight bottleneck means all learning signal flows through a single policy, making optimization tractable.
We established that curation feedback is delayed and indirect. The curator inserts a skill now, and its value is revealed only when a related future task benefits from it. So how do we construct training data that provides this feedback?
SkillOS's first key design: grouped task streams. Instead of training on random sequences of tasks, SkillOS groups related tasks together and trains on entire groups as single instances.
For each task xi in the training set, SkillOS uses Gemini-2.5-Pro to produce a set of tags:
Each tag z captures a salient aspect of the task — topic, strategy, common pitfall. For ALFWorld, these are the built-in task type annotations (Pick, Clean, Heat, Cool, etc.). For reasoning tasks like MATH, tags might be "algebra," "Fourier transformation," or "inequality manipulation."
Based on tag similarity, SkillOS partitions the full training set D into M groups:
All tasks within a group share non-trivial skill dependencies — they are the kind of tasks where solving one should help solve the others.
Each training step samples one group Gm and starts with an empty SkillRepo. The system then iterates through the group's tasks sequentially:
The first task in each group always uses an empty SkillRepo, so its outcome is independent of curation. The task outcome reward is therefore computed only over tasks 2 through |G|:
This is the core trick: by grouping related tasks, the paper creates a within-group feedback loop where earlier curation decisions are evaluated by later task outcomes. The curator learns to write skills that help on future related tasks, not just skills that describe the current task.
Related tasks are clustered into groups. Early tasks (darker) generate skills; later tasks (lighter) evaluate them. Click a group to see how skills transfer within it.
Prior RL-based skill methods like ARISE and UMEM train on short task streams — often just 2 adjacent tasks. This limits the density of feedback: the curator only sees whether a skill helped the immediately next task. SkillOS's longer grouped streams (|G| = 4-8 tasks) expose the curator to multi-hop skill evolution, where:
This three-step feedback arc — insert, fail, update, succeed — cannot be learned from 2-task windows. Grouped streams provide the trajectory length needed to learn update and delete behaviors.
Grouped task streams provide the structure for learning curation. But we also need the right reward signal. A single "did the downstream task succeed?" reward is too sparse — the curator makes dozens of micro-decisions (which section to write, how verbose to be, whether to include "When NOT to Use") and needs finer-grained feedback.
SkillOS addresses this with a composite reward that combines four signals, each targeting a different failure mode:
With weights λf = 1.0, λu = 0.1, λc = 0.05. Let's examine each component.
The primary signal. Average success rate over evaluation tasks (tasks 2 through |G|):
What it catches: The overall quality of the curated SkillRepo. If the skills are good, downstream tasks succeed more often.
What it misses: Everything about HOW the curator produced those skills. A curator that writes valid, well-structured skills that happen not to be relevant to the evaluation tasks gets rtask = 0. We need additional signals to guide learning when task outcomes are uninformative.
Measures whether the curator produces valid, executable function calls:
Where Valid(ci) is the fraction of function calls in curation decision ci that parse correctly and execute successfully. An insert_skill call that references a malformed filename, or an update_skill call targeting a non-existent file, gets a score of 0.
What it catches: Formatting errors, hallucinated filenames, invalid JSON. Without this signal, the curator might spend many early training steps producing outputs that fail to execute at all.
Uses an external judge (Qwen3-32B) to evaluate whether curated skills are semantically meaningful and likely useful:
What it catches: Low-quality content. A skill that just copies the raw trajectory verbatim gets a low judge score. A skill that extracts a clean, generalizable workflow gets a high one. This intermediate supervision is critical in a pipelined system where the curator never directly sees downstream task outcomes.
Ablating rcnt drops ALFWorld success from 61.2% to 58.6% — the largest drop among the auxiliary rewards.
Discourages verbatim trajectory copying by rewarding concise repository updates:
Where |Si| is the token length of the SkillRepo after applying operations at step i, and |χi| is the token length of the curator's input context. If the skills are shorter than the input (good — we compressed), the reward is positive. If the skills are longer than the input (bad — we're storing raw trajectories), the reward is negative.
What it catches: Bloated repositories. An important failure mode is the curator copying entire trajectories into skill files instead of distilling them into concise instructions. The compression reward explicitly penalizes this.
Adjust the sliders to see how each reward component contributes to the total. The paper uses λf=1.0, λu=0.1, λc=0.05.
The paper sets λf = 1.0 (function call validity weighted equally with task outcome), λu = 0.1 (content quality is a soft guide, not a hard constraint), and λc = 0.05 (compression is a gentle nudge). This weighting makes sense: task outcome is the ground truth, function calls must be valid for anything to work, content quality is informative but subjective, and compression is a nice-to-have.
Now we have the training structure (grouped task streams) and the reward signal (composite reward). How do we actually optimize the curator policy? SkillOS uses Group Relative Policy Optimization (GRPO), an RL algorithm originally developed for DeepSeek-Math.
Standard policy gradient methods like PPO require a separate critic network — a value function that estimates expected future reward from each state. Training a critic for skill curation is problematic because:
GRPO eliminates the critic entirely. Instead, it estimates advantages by comparing multiple rollouts of the same task group against each other.
For each task group G, SkillOS samples N independent rollouts from the curator policy. Each rollout produces a different sequence of curation decisions, which leads to a different SkillRepo evolution, which leads to different executor outcomes. This gives N composite reward values {r1, r2, ..., rN}.
The advantage for rollout n is simply:
That is: "How much better (or worse) was this rollout compared to the average?" No critic, no value function, just relative comparison within the group.
The policy is then updated with a clipped surrogate objective (same as PPO's clipping):
Where ρn = πS(cn | χ) / πθold(cn | χ) is the importance ratio between the current and old policy. The clipping prevents the policy from changing too drastically in one step.
| Hyperparameter | Value |
|---|---|
| Base model for πS | Qwen3-8B |
| Executor during training | Qwen3-8B (frozen) |
| Learning rate | 1 × 10-6 |
| Batch size | 32 (task groups per batch) |
| Group size N (rollouts per group) | 8 |
| Hardware | 16 × H100 GPUs |
| Training time (ALFWorld) | ~3 days |
| Training time (WebShop) | ~5 days |
| Training time (Reasoning) | ~2.5 days |
| Framework | verl (HybridFlow) |
The paper provides a fascinating view of how the curator evolves during training (Figure 4). The operation distribution tells the story:
Early training (steps 1-10): Insert dominates at ~95%. The curator knows only one move: "See trajectory, write new skill." This is the naive behavior — pure expansion.
Mid training (steps 10-30): Update grows to ~25%. The curator learns that revising existing skills is more valuable than creating new ones. It starts recognizing when an existing skill almost matches but needs refinement.
Late training (steps 30+): Delete appears at ~5-8%. The curator learns to prune redundant or harmful skills. The SkillRepo becomes more curated, not just larger.
N rollouts of the same task group produce different rewards. Advantages are computed relative to the group mean. Drag the slider to change the number of rollouts.
Let's trace a single training step. The batch samples group G = {Heat Egg, Heat Mug, Heat Apple, Heat Potato}. SkillRepo starts empty.
Rollout 1: After Task 1, curator inserts "Heating objects workflow" skill. Tasks 2-4 all succeed using this skill. rtask = 1.0. Total r = 1.0 + 0.92 + 0.07 + 0.04 = 2.03.
Rollout 2: After Task 1, curator inserts a very verbose skill (copies entire trajectory). Task 2 succeeds but slowly (executor confused by long skill). Task 3 fails. Task 4 succeeds. rtask = 0.67. Compression reward low (0.2). Total r = 0.67 + 0.85 + 0.05 + 0.01 = 1.58.
Rollout 3: Curator produces invalid JSON for the insert call. No skills are added. Tasks 2-4 run without skills. rtask = 0.33. rfc = 0. Total r = 0.33 + 0 + 0 + 0.05 = 0.38.
Mean reward: (2.03 + 1.58 + 0.38) / 3 = 1.33. Advantages: A1 = +0.70, A2 = +0.25, A3 = -0.95. GRPO reinforces Rollout 1's behavior and suppresses Rollout 3's.
SkillOS is evaluated across three benchmark categories with multiple executor backbones. The results tell a consistent story: trained curation beats both no-memory baselines and heuristic-based memory systems.
ALFWorld is a text-based environment where agents navigate rooms, manipulate objects, and complete household tasks ("Put a heated egg on the counter," "Examine the mug under desklamp"). There are 6 task subtypes: Pick, Look, Clean, Heat, Cool, and Pick2. Results are reported as success rate (SR) and average interaction steps.
With Qwen3-8B as executor:
| Method | Avg SR (%) | Steps |
|---|---|---|
| No Memory | 47.9 | 21.1 |
| ReasoningBank | 55.7 | 20.1 |
| MemP | 49.7 | 21.0 |
| SkillOS-base (no RL) | 53.1 | 20.4 |
| SkillOS-gemini (Gemini curator) | 50.7 | 20.8 |
| SkillOS | 61.2 | 18.9 |
Three things stand out. First, SkillOS beats the strongest baseline (ReasoningBank) by +5.5 absolute points. Second, SkillOS reduces interaction steps from 21.1 to 18.9 — the agent is not just more successful, it is faster. Third, the RL-trained 8B curator outperforms Gemini-2.5-Pro used directly as curator (SkillOS-gemini: 50.7%). A small, targeted model beats a frontier model at this specific skill.
WebShop simulates an online shopping environment. The agent navigates a web interface to find and purchase products matching user specifications. Metrics: score, success rate (SR), and interaction steps.
| Method | Score | SR (%) | Steps |
|---|---|---|---|
| No Memory | 33.3 | 9.8 | 20.3 |
| ReasoningBank | 35.4 | 11.4 | 20.5 |
| SkillOS-base | 38.6 | 13.6 | 20.1 |
| SkillOS | 40.6 | 16.5 | 19.4 |
SkillOS improves SR from 9.8% (no memory) to 16.5% — a 68% relative improvement. The gains are even more dramatic with stronger executors: with Gemini-2.5-Pro as executor, SkillOS reaches 41.3% SR vs. 38.4% for no memory.
Single-turn reasoning tasks show more modest gains, but SkillOS still improves consistently:
| Method | AIME24 | AIME25 | GPQA | Avg |
|---|---|---|---|---|
| No Memory | 76.0 | 71.1 | 61.8 | 69.6 |
| ReasoningBank | 75.4 | 73.2 | 60.3 | 69.6 |
| SkillOS | 80.0 | 76.7 | 64.6 | 73.8 |
The gains are smaller (+4.2 average accuracy) because reasoning tasks benefit from more abstract skill types (decomposition heuristics, verification patterns) that are harder to capture in procedural skills. Still, SkillOS is the only method that consistently improves over no-memory.
A crucial test: does a curator trained with Qwen3-8B executor transfer to different executors? Yes. SkillOS lifts Gemini-2.5-Pro's ALFWorld SR from 66.4% to 80.2% — a +13.8 improvement, even though the curator never saw this executor during training.
Cross-task transfer (Figure 3 in the paper) also works: a curator trained on reasoning tasks improves ALFWorld performance by +13.3 with Qwen3-8B executor. The reasoning-trained curator learns abstract strategies (decomposition, verification, adaptive planning) that transfer to agentic tasks.
Performance comparison across methods and benchmarks. Click a benchmark to see detailed results.
The most fascinating finding in the paper is not the performance numbers — it is what happens inside the SkillRepo as training progresses. The curator does not just get better at inserting skills. It develops an entirely new organizational structure that was never explicitly programmed.
The skill format suggests three sections: Workflow, When NOT to Use, and Prerequisites. But SkillOS's trained curator creates additional sections that were never specified. Figure 5(a) in the paper tracks these emergent sections across training:
Early training: The curator adds generic sections — "Additional Guidance," "Tips and Recommendations," "Enhancement." These are verbose and add little operational value. They are the model's default verbosity patterns.
Late training: The sections become execution-oriented:
RL gradually steered the curator from superficial enrichment toward execution-oriented skill refinement. The curator learned — through trial and error — that a "Retry Logic" section makes the executor more robust, while a "Tips" section just adds noise.
Even more remarkable: the SkillRepo develops skills about skills. Figure 5(b) tracks the evolution of skill categories:
Early SkillRepo: Dominated by narrow, task-specific skills. "How to heat an egg." "How to clean a mug." Each skill covers exactly one task variant.
Late SkillRepo: A diverse mix including meta-strategy skills:
The task-specific skills (e.g., "task-object specific," "task-location specific") shrink from dominating the repo to occupying less than 30%. The curator discovers that abstract, compositional skills are more valuable than narrow, task-specific ones.
Figure 6 in the paper compares skill usage statistics between SkillOS-base and SkillOS:
| Metric | SkillOS-base | SkillOS |
|---|---|---|
| Skill usage rate | 87.9% | 100% |
| Successful skill usage rate | 53.6% | 61.2% |
| Skill coverage | 72.9% | 88.6% |
| Avg skills per example | 2.24 | 1.95 |
SkillOS invokes skills on 100% of evaluation examples (vs. 87.9% for the base) and achieves higher success when doing so. Crucially, it uses fewer skills per example (1.95 vs. 2.24) while achieving better coverage of the repo (88.6% vs. 72.9%). The trained curator produces skills that are more precisely targeted — less noise, more signal.
Watch how the SkillRepo changes over training steps. Skills appear, merge, specialize, and develop meta-structure. Click "Evolve" to advance training.
Early in training, the curator might create separate skills for "Heat egg in microwave" and "Heat mug in microwave." These have 80% identical content — both involve finding the microwave, putting the object in, and turning it on. Only the object name differs.
After RL training, the curator learns to create a single "Heat objects using microwave" skill with a conditional: "Works for any heatable object (egg, mug, apple, potato). Verify object is picked up before approaching microwave." This merged skill is more compact, more retrievable (matches more queries), and easier for the executor to follow.
The compression reward (rcomp) nudges this behavior, but it is the task outcome reward (rtask) that truly drives it: a merged skill that covers four task variants produces better outcomes than four fragmented skills that might or might not be retrieved.
SkillOS sits at the intersection of several active research areas. Let's map where it fits and what comes next.
| System | Memory Type | Curation Method | Key Difference from SkillOS |
|---|---|---|---|
| ReAct | None (stateless) | N/A | No memory at all — pure reasoning + acting |
| Reflexion | Text reflections | Prompted LLM | Stores verbal self-critiques, not reusable skills |
| ReasoningBank | Distilled insights | Prompted LLM | No RL training — heuristic curation only |
| MemP | Procedural memory | Heuristic operations | Fixed rules for memory management |
| Anthropic Skills | SKILL.md folders | Manual curation | Human-written skills — no automation |
| ARISE | Skill library | RL (retrieval + use) | Trains retrieval + execution, heuristic management |
| SkillRL / D2Skill | Pre-curated skills | RL (use only) | Trains agents to use skills, not to curate them |
| SkillOS | Markdown skills | RL (curation) | First to train curation end-to-end via long-horizon RL |
Retrieval bottleneck. SkillOS uses BM25 for skill retrieval — a lexical matching algorithm that cannot capture semantic similarity. A skill titled "Systematic Container Search" would not match a query about "finding hidden objects." Dense retrieval (embedding-based) could significantly improve skill utilization.
Fixed skill format. Skills are always single Markdown files. More complex formats — nested folder structures like Anthropic's full SKILL.md spec, or programmatic skills with executable code — could encode richer knowledge.
Training cost. 3-5 days on 16 H100 GPUs is substantial. Each training step requires rolling out entire task groups through the executor, which involves multiple LLM inference calls. More efficient training methods (offline RL, distillation from larger curators) could reduce this cost.
Catastrophic forgetting. The paper does not explore what happens when the task distribution shifts. A curator trained on ALFWorld heating tasks might produce inappropriate skills when suddenly faced with navigation tasks. Continual learning of the curator is an open problem.
SkillOS demonstrates a powerful principle: you can train a small model to be an excellent specialist. An 8B model trained specifically for skill curation outperforms a frontier model doing the same task zero-shot. This suggests a future architecture for AI agents:
This multi-agent modular design mirrors how human organizations work: senior engineers solve problems, technical writers maintain the knowledge base, and everyone benefits from shared documentation.
Where SkillOS fits among memory-based agent systems. Axes: curation automation (x) and feedback horizon (y).
To deepen your understanding of the ideas in this paper:
"What I cannot create, I do not understand. What I cannot curate, I cannot scale."
— Adapted from Richard Feynman