Swapping a cloud model for a local 9B model drops accuracy by 25–39 pp. Not because the small model is dumb — because the entire stack was co-designed around the cloud model. OpenJarvis decomposes the stack into typed primitives and searches for a configuration where the 9B model matches or exceeds cloud performance — at 800× lower cost.
You have a personal AI assistant running on your laptop. It helps you write code, search the web, manage files, answer questions. But it runs through the cloud — every query goes to Claude Opus 4.6 or GPT-4.1, costs money, leaks your data, and adds latency. You want to run the whole thing locally on your own hardware.
So you do the obvious thing: you swap the cloud model for a local one. You download Qwen3.5-9B, a capable 9-billion-parameter model that fits in your GPU’s VRAM. You point your existing AI stack at it — same prompts, same tools, same agent loop — just a different model endpoint.
And accuracy craters.
Why so catastrophic? The prompts were written for a specific model. The agent loop assumes a certain reasoning depth. The tool descriptions expect a certain level of instruction-following. The system is a monolith — every piece was co-designed around the cloud model, and none of it adapts when you change the brain.
This is the substitution catastrophe: the gap between cloud and local is not primarily about model capability. It is about architectural coupling. A 9B model can do these tasks — if the surrounding system is reconfigured for it.
OpenJarvis proves this. By decomposing the stack into independently-tunable primitives and searching for the right configuration, a local 9B model matches or exceeds cloud performance on 4 out of 8 benchmarks, at 800× lower cost per query.
Accuracy of different AI stacks when you swap the cloud model (Claude Opus 4.6) for a local model (Qwen3.5-9B). Blue bars = cloud. Orange bars = local drop-in. Green bars = local with OpenJarvis spec. Click bars to see exact numbers.
Existing AI stacks treat the system as a black box. You have a model, some prompts, an agent loop, and some tools. They are all tangled together in code. Changing one piece requires rewriting others. There is no clean interface between components.
OpenJarvis’s insight is: decompose the stack into typed, independently-optimizable primitives. If each piece has a well-defined interface and can be changed without breaking the others, then you can search for the right combination — and that search can be automated.
Think of it like a car. A monolithic stack is a car where the engine, transmission, suspension, and wheels are welded together. Want a different engine? Too bad — you need a new car. OpenJarvis gives you a car with standardized bolt patterns. Swap the engine, keep the transmission. Change the suspension, keep the wheels. Each primitive is a replaceable module with defined inputs and outputs.
This is profound. It means the gap between cloud and local AI is not an insurmountable capability chasm. It is a configuration problem — and configuration problems can be solved by search.
But search over what space? How do you enumerate the knobs? That requires defining the primitives precisely — which brings us to the five primitives of OpenJarvis.
OpenJarvis decomposes every personal AI system into exactly five primitives. This is not an arbitrary choice — it is the minimal set that captures the full optimization surface. Miss one, and you leave performance on the table. Add more, and you introduce unnecessary complexity.
The model itself. Its architecture, its weights, its generation parameters (temperature, top-p, top-k, repetition penalty), and its quantization format (FP16, INT8, INT4, GGUF). This is what most people think of when they think of AI optimization — but as we’ll see, it accounts for only 16–44% of the edits that actually improve performance.
The inference runtime. This is how the model runs, not what it is. Options include Ollama, vLLM, llama.cpp, and SGLang. Engine parameters include batch size, KV cache strategy (paged vs. contiguous), speculative decoding, and context length. A model can behave very differently depending on its engine — llama.cpp with Q4_K_M quantization produces different outputs than vLLM with AWQ.
The reasoning loop. This is how the model decides what to do at each step. Options include ReAct (reason-then-act), CodeAct (generate and execute code as the action), Plan-then-Execute, and simple single-turn prompting. Agent parameters include the system prompt, the maximum number of reasoning steps, whether to use chain-of-thought, and the tool-use policy (parallel vs. sequential tool calls).
The external interfaces. What tools does the model have access to? How are they described in the prompt? What retrieval mechanisms are available (vector DB, keyword search, hybrid)? What persistent state does the system maintain about the user? Tool descriptions are surprisingly critical — a vague description can cause a small model to misuse a tool that a large model would use correctly.
The meta-primitive. Learning is the optimizer that updates the other four primitives from execution traces. It is what makes the system self-improving. Learning includes: which cloud model serves as the optimizer, what kind of edits it proposes, the gating criteria for accepting edits, and the training data selection strategy.
Click each primitive to see its editable fields and example values. The Learning primitive (bottom) optimizes the other four.
Why this particular decomposition? Because it covers the four dimensions of variation the authors observed in real AI stacks:
| Dimension | Primitive | Example Edit |
|---|---|---|
| What the model knows | Intelligence | Switch from INT4 to INT8 quantization |
| How the model runs | Engine | Switch from Ollama to vLLM, enable speculative decoding |
| How the model reasons | Agents | Switch from ReAct to CodeAct, rewrite system prompt |
| What the model accesses | Tools & Memory | Add a code sandbox tool, refine search tool description |
A Spec is a typed configuration object that fully specifies a personal AI system. It is TOML-serializable, shareable, versionable, and — crucially — evaluable. Given a Spec and a benchmark, you can measure accuracy, latency, energy, and cost per query in a single evaluation pass.
Here is what an actual Spec looks like:
toml # OpenJarvis Spec — Qwen3.5-9B for SWE-Bench [intelligence] model = "Qwen/Qwen3.5-9B" quantization= "awq-int4" temperature = 0.6 top_p = 0.95 top_k = 40 max_tokens = 4096 rep_penalty = 1.05 [engine] runtime = "vllm" batch_size = 1 kv_cache = "paged" context_len = 32768 spec_decode = false [agents] loop = "codeact" max_steps = 30 cot = true tool_policy = "sequential" system_prompt = """ You are a software engineer. Read the issue. Write a patch that fixes the bug. Always test your code before submitting. Use the bash tool to navigate the repo. """ [tools] enabled = ["bash", "file_editor", "search"] sandbox = true timeout_s = 120 [memory] retrieval = "none" context_window = "sliding"
A Spec is not evaluated on accuracy alone. The OpenJarvis evaluation function measures five dimensions simultaneously:
| Metric | Symbol | What It Measures | Unit |
|---|---|---|---|
| Accuracy | Racc | Task success rate on the benchmark | % (0–100) |
| Energy | Ê | Joules consumed per query | J/query |
| Latency | L̂ | Wall-clock time per query | seconds |
| Cost | Ĉ | Marginal API/compute cost per query | $/query |
| Power | P | Average watts during inference | W |
This multi-dimensional evaluation is why OpenJarvis finds Pareto-optimal specs: you can trade accuracy for latency, or accuracy for cost, and the search explores these tradeoffs systematically.
Now the payoff. We have a Spec that decomposes the system into primitives. We have a multi-dimensional evaluation function. The question is: how do we find the right Spec — the one where the local 9B model actually works?
The answer is Algorithm 1: LLM-Guided Spec Search. It is a loop where a cloud frontier model (the “proposer”) reads execution traces from the current Spec, identifies failure clusters, proposes coordinated edits across all four runtime primitives, and gates those edits to ensure they help without causing regression.
Suppose 12 out of 40 training examples fail because the agent enters an infinite reasoning loop. The cloud model reads these 12 traces and proposes:
The gate evaluates: 9 of the 12 looping examples now succeed (target cluster improves). The other 28 examples still pass at the same rate (no regression beyond ε). Edit accepted.
Step through the optimization loop. Watch failure clusters get identified, edits get proposed across primitives, and the gate accept or reject changes. Adjust ε to see how the regression threshold affects convergence.
The gate is the quality control mechanism of Spec Search. Without it, the optimizer would oscillate — fixing one failure mode while breaking another, over and over. The gate ensures monotonic improvement: each accepted edit makes the system at least as good as before, with real progress on its target cluster.
Let S be the current Spec and S′ be the proposed Spec (after applying the edit). Let c be the target cluster and c′ be any non-target cluster. Let Gc(S) be the accuracy of Spec S on cluster c. The gate accepts edit e if and only if:
In words: the target cluster must strictly improve, and every other cluster must not regress by more than ε. The default ε is 1%, meaning the optimizer tolerates at most 1 percentage point of regression on any non-target cluster.
When the edit involves changing the model (Intelligence primitive), the gate needs a richer reward signal than just accuracy. Running a bigger quantization costs more energy and latency. The composite reward balances these:
Where:
Suppose the current Intelligence spec uses INT4 quantization and the optimizer proposes switching to INT8. Here is the gate check for one cluster:
| Metric | INT4 (current) | INT8 (proposed) |
|---|---|---|
| Racc (accuracy) | 0.70 | 0.82 |
| Ê (energy, normalized) | 0.30 | 0.45 |
| L̂ (latency, normalized) | 0.25 | 0.35 |
| Ĉ (cost, normalized) | 0.20 | 0.20 |
Current reward:
Proposed reward:
R′ = 0.270 > R = 0.235. Target cluster improved. If non-target clusters also pass the ε check, this edit is accepted.
The cleanest test of the Spec abstraction is a controlled substitution experiment. Take the same model (Qwen3.5-9B), the same benchmarks, and two different stacks: one designed as a monolith (OpenClaw, Hermes Agent), and one designed with the Spec abstraction (OpenJarvis). How much does each drop when you swap cloud for local?
The paper reports results on 8 benchmarks. Here are the four with the most dramatic differences:
| Benchmark | Cloud (Opus 4.6) | OpenClaw + Qwen | Drop | OpenJarvis + Qwen | Drop |
|---|---|---|---|---|---|
| SWE-Bench | 79.4% | 45.7% | −33.7 pp | 71.8% | −7.6 pp |
| LiveCodeBench | 73.2% | 47.8% | −25.4 pp | 83.0% | +9.8 pp |
| PinchBench | 88.5% | 49.3% | −39.2 pp | 100.0% | +11.5 pp |
| LiveResearchBench | 82.7% | 55.4% | −27.3 pp | 91.0% | +8.3 pp |
The cloud model and the monolithic stacks share the exact same model weights (Qwen3.5-9B). The only difference is everything else:
Side-by-side comparison of monolithic stacks vs. OpenJarvis on the same local model. Toggle between benchmarks to see the pattern.
Performance is not a single number. A system that gets 95% accuracy but costs $0.50 per query is very different from one that gets 88% accuracy at $0.0006 per query. OpenJarvis explores this tradeoff space explicitly, producing a Pareto frontier — the set of specs where you cannot improve one metric without degrading another.
| System | Avg Accuracy | Cost/Query | Latency | Where It Runs |
|---|---|---|---|---|
| Cloud (Claude Opus 4.6) | ~80% | $0.48 | ~12s | Cloud |
| Cloud (GPT-4.1) | ~75% | $0.32 | ~8s | Cloud |
| OpenJarvis Spec (Qwen 9B) | ~77% | $0.0006 | ~3s | Your laptop |
The latency improvement is equally dramatic: 4× faster. Cloud queries take ~12 seconds (network round-trip + cloud queue + inference). Local queries take ~3 seconds (just inference, no network). For interactive use, this is the difference between “waiting for a response” and “instant.”
The paper reconstructs the accuracy-vs-cost Pareto frontier across all evaluated systems. The key insight: OpenJarvis specs cluster in the bottom-right corner — high accuracy, low cost. Cloud systems cluster in the top-right: high accuracy but extremely high cost. The Pareto frontier shows that once you cross a threshold of optimization effort, the on-device systems dominate the cost axis.
Hover over points to see system details. The dashed line shows the Pareto frontier. Systems below and to the right are dominated (worse on both axes). Notice how OpenJarvis specs cluster in the sweet spot: near-cloud accuracy at local-device cost.
The average accuracy gap between the best cloud system and the best OpenJarvis spec is just 3.2 pp across all 8 benchmarks. On 4 benchmarks, the gap is negative — the local spec wins. This 3.2 pp gap is the price of full data privacy, 800× cost reduction, and 4× latency reduction. For most personal AI use cases, that is a bargain.
The ablation study in Figure 8 of the paper answers two questions: (1) Does the proposer matter (LLM vs. evolutionary vs. random)? (2) Does the move space matter (1 primitive vs. 4 primitives)?
The authors cross three proposer strategies with three move-space sizes:
| 1 Primitive | 2 Primitives | 4 Primitives | |
|---|---|---|---|
| LLM Proposer | 77.5% | 87.0% | 93.3% |
| Evolutionary | 70.0% | 79.5% | 83.8% |
| Random | 65.5% | 75.0% | 80.5% |
(Numbers are averaged across benchmarks for illustration; actual per-benchmark ranges are: LLM-4 = 83–100%, Evolutionary-4 = 73–94.5%, Random-4 = 69–92.5%.)
The evolutionary proposer mutates edits randomly (e.g., “change temperature from 0.7 to 0.6”) and selects the best. The LLM proposer reads the failure traces and proposes targeted fixes. It can say: “These 8 failures all happen when the agent tries to use the bash tool for file editing — the bash tool description is confusing. I will: (1) clarify the bash tool description, (2) add a file_editor tool, (3) update the system prompt to prefer file_editor for edits.”
This is a coordinated, multi-primitive edit informed by causal diagnosis. The evolutionary proposer would need to stumble upon this combination by chance.
The paper reports the distribution of accepted edits across primitives (Table 10), and the pattern varies dramatically by benchmark type:
| Benchmark Type | Intelligence | Agent | Tools | Engine |
|---|---|---|---|---|
| Coding (SWE-Bench, LiveCode) | 44% | 28% | 18% | 10% |
| Customer Service (Tau2) | 16% | 46% | 24% | 14% |
| Research (LiveResearch) | 22% | 20% | 42% | 16% |
This is why single-primitive optimization fails. If you only tune the model (Intelligence), you leave 56–84% of the improvement on the table. The spec search must touch all four primitives.
LLM-guided spec search is 7–11× cheaper than single-primitive baselines at the same accuracy level. The reason: the LLM proposer makes informed edits, so fewer iterations are wasted. Random search at 4 primitives needs ~120 iterations to converge; LLM search needs ~15–25.
The grid shows average accuracy for each combination of proposer strategy and number of editable primitives. Brighter = higher accuracy. Both axes contribute — the best cell (LLM × 4 primitives) is brightest.
| System | What It Optimizes | How OpenJarvis Differs |
|---|---|---|
| GEPA | Prompts via reflective mutation | GEPA optimizes prompts only; OpenJarvis optimizes all 4 runtime primitives jointly |
| DSPy / MIPROv2 | Prompts + few-shot examples | DSPy optimizes within a single primitive (prompts). OpenJarvis adds Engine, Tools, Intelligence to the search space |
| LoRA | Model weights via low-rank adapters | Weight updates are just one of 4 primitives. OpenJarvis finds that Agent/Tool edits contribute 56–84% of improvement |
| Minions | Local-cloud collaboration at inference | Minions requires cloud at inference time. OpenJarvis uses cloud only during optimization; inference is fully local |
| Archon | Multi-LLM ensemble architecture | Archon ensembles multiple models. OpenJarvis optimizes a single-model stack for on-device deployment |