OpenJarvis — Veanors

Chapter 0: The Problem

You have a personal AI assistant running on your laptop. It helps you write code, search the web, manage files, answer questions. But it runs through the cloud — every query goes to Claude Opus 4.6 or GPT-4.1, costs money, leaks your data, and adds latency. You want to run the whole thing locally on your own hardware.

So you do the obvious thing: you swap the cloud model for a local one. You download Qwen3.5-9B, a capable 9-billion-parameter model that fits in your GPU’s VRAM. You point your existing AI stack at it — same prompts, same tools, same agent loop — just a different model endpoint.

And accuracy craters.

The substitution catastrophe: On OpenClaw (a coding agent stack), Claude Opus 4.6 scores 79.4% on SWE-Bench. Drop in Qwen3.5-9B with zero changes? 45.7%. That is a 33.7 percentage-point collapse. On Hermes Agent (a general assistant stack), the drop is 25–39 pp across eight benchmarks. The system doesn’t just degrade gracefully — it falls off a cliff.

Why so catastrophic? The prompts were written for a specific model. The agent loop assumes a certain reasoning depth. The tool descriptions expect a certain level of instruction-following. The system is a monolith — every piece was co-designed around the cloud model, and none of it adapts when you change the brain.

This is the substitution catastrophe: the gap between cloud and local is not primarily about model capability. It is about architectural coupling. A 9B model can do these tasks — if the surrounding system is reconfigured for it.

OpenJarvis proves this. By decomposing the stack into independently-tunable primitives and searching for the right configuration, a local 9B model matches or exceeds cloud performance on 4 out of 8 benchmarks, at 800× lower cost per query.

The Substitution Catastrophe

Accuracy of different AI stacks when you swap the cloud model (Claude Opus 4.6) for a local model (Qwen3.5-9B). Blue bars = cloud. Orange bars = local drop-in. Green bars = local with OpenJarvis spec. Click bars to see exact numbers.

Benchmark SWE-Bench

Why does swapping a cloud model for a local model cause such a large accuracy drop in existing AI stacks?

Because local models have less training data and cannot learn the same tasks Because the entire stack (prompts, agent loop, tool descriptions) was co-designed around the cloud model, and none of it adapts when you change the model Because local hardware cannot run inference fast enough for the agent loop to work

Chapter 1: The Key Insight

Existing AI stacks treat the system as a black box. You have a model, some prompts, an agent loop, and some tools. They are all tangled together in code. Changing one piece requires rewriting others. There is no clean interface between components.

OpenJarvis’s insight is: decompose the stack into typed, independently-optimizable primitives. If each piece has a well-defined interface and can be changed without breaking the others, then you can search for the right combination — and that search can be automated.

Monolithic Stack (Before)

Model + prompts + agent loop + tools are entangled in code. Changing the model breaks everything. Only option: accept the accuracy drop or stay in the cloud.

↓ decompose ↓

OpenJarvis Stack (After)

Five typed primitives: Intelligence, Engine, Agents, Tools & Memory, Learning. Each has a schema. Each can be changed independently. The optimizer can search across all of them.

Think of it like a car. A monolithic stack is a car where the engine, transmission, suspension, and wheels are welded together. Want a different engine? Too bad — you need a new car. OpenJarvis gives you a car with standardized bolt patterns. Swap the engine, keep the transmission. Change the suspension, keep the wheels. Each primitive is a replaceable module with defined inputs and outputs.

The drop is architectural, not capability-bound. When the authors gave Qwen3.5-9B the right prompts, the right agent loop, the right tool descriptions, and the right inference parameters — all tuned jointly — it matched or exceeded the cloud model on half the benchmarks. The small model was never the bottleneck. The configuration was.

This is profound. It means the gap between cloud and local AI is not an insurmountable capability chasm. It is a configuration problem — and configuration problems can be solved by search.

But search over what space? How do you enumerate the knobs? That requires defining the primitives precisely — which brings us to the five primitives of OpenJarvis.

What is the core realization that makes OpenJarvis possible?

That local models need to be fine-tuned on the same data as cloud models That agent loops should always use ReAct-style reasoning That the cloud-local accuracy gap is a configuration problem, not a capability problem, and can be solved by decomposing the stack into independently-optimizable primitives

Chapter 2: The Five Primitives

OpenJarvis decomposes every personal AI system into exactly five primitives. This is not an arbitrary choice — it is the minimal set that captures the full optimization surface. Miss one, and you leave performance on the table. Add more, and you introduce unnecessary complexity.

1. Intelligence

The model itself. Its architecture, its weights, its generation parameters (temperature, top-p, top-k, repetition penalty), and its quantization format (FP16, INT8, INT4, GGUF). This is what most people think of when they think of AI optimization — but as we’ll see, it accounts for only 16–44% of the edits that actually improve performance.

2. Engine

The inference runtime. This is how the model runs, not what it is. Options include Ollama, vLLM, llama.cpp, and SGLang. Engine parameters include batch size, KV cache strategy (paged vs. contiguous), speculative decoding, and context length. A model can behave very differently depending on its engine — llama.cpp with Q4_K_M quantization produces different outputs than vLLM with AWQ.

3. Agents

The reasoning loop. This is how the model decides what to do at each step. Options include ReAct (reason-then-act), CodeAct (generate and execute code as the action), Plan-then-Execute, and simple single-turn prompting. Agent parameters include the system prompt, the maximum number of reasoning steps, whether to use chain-of-thought, and the tool-use policy (parallel vs. sequential tool calls).

4. Tools & Memory

The external interfaces. What tools does the model have access to? How are they described in the prompt? What retrieval mechanisms are available (vector DB, keyword search, hybrid)? What persistent state does the system maintain about the user? Tool descriptions are surprisingly critical — a vague description can cause a small model to misuse a tool that a large model would use correctly.

5. Learning

The meta-primitive. Learning is the optimizer that updates the other four primitives from execution traces. It is what makes the system self-improving. Learning includes: which cloud model serves as the optimizer, what kind of edits it proposes, the gating criteria for accepting edits, and the training data selection strategy.

Learning is special: it is the only primitive that does not run at inference time. Once optimization is done, the Learning primitive’s job is finished. The resulting spec (Intelligence + Engine + Agents + Tools/Memory) runs entirely on-device. Learning exists only during the optimization phase — it is the compiler, not the runtime.

The Five Primitives

Click each primitive to see its editable fields and example values. The Learning primitive (bottom) optimizes the other four.

Why this particular decomposition? Because it covers the four dimensions of variation the authors observed in real AI stacks:

Dimension	Primitive	Example Edit
What the model knows	Intelligence	Switch from INT4 to INT8 quantization
How the model runs	Engine	Switch from Ollama to vLLM, enable speculative decoding
How the model reasons	Agents	Switch from ReAct to CodeAct, rewrite system prompt
What the model accesses	Tools & Memory	Add a code sandbox tool, refine search tool description

Why is the Learning primitive NOT included in the final on-device spec?

Because it runs only during optimization — it is the compiler that produces the runtime spec, not part of the runtime itself Because it requires too much memory to run on-device Because learning is always done in the cloud for security reasons

Chapter 3: The Spec Abstraction

A Spec is a typed configuration object that fully specifies a personal AI system. It is TOML-serializable, shareable, versionable, and — crucially — evaluable. Given a Spec and a benchmark, you can measure accuracy, latency, energy, and cost per query in a single evaluation pass.

Here is what an actual Spec looks like:

toml
# OpenJarvis Spec — Qwen3.5-9B for SWE-Bench

[intelligence]
model       = "Qwen/Qwen3.5-9B"
quantization= "awq-int4"
temperature = 0.6
top_p       = 0.95
top_k       = 40
max_tokens  = 4096
rep_penalty = 1.05

[engine]
runtime     = "vllm"
batch_size  = 1
kv_cache    = "paged"
context_len = 32768
spec_decode = false

[agents]
loop        = "codeact"
max_steps   = 30
cot         = true
tool_policy = "sequential"
system_prompt = """
You are a software engineer. Read the issue.
Write a patch that fixes the bug.
Always test your code before submitting.
Use the bash tool to navigate the repo.
"""

[tools]
enabled     = ["bash", "file_editor", "search"]
sandbox     = true
timeout_s   = 120

[memory]
retrieval   = "none"
context_window = "sliding"

Think of the Spec as a recipe. Sharing a Spec is like sharing a recipe for a dish. The ingredients (model weights) are the same for everyone who downloads Qwen3.5-9B. The recipe (the Spec) tells you how to combine them — what temperature to cook at, what order to add ingredients, what tools to use. Two different recipes for the same ingredients can produce wildly different dishes.

Joint Evaluation

A Spec is not evaluated on accuracy alone. The OpenJarvis evaluation function measures five dimensions simultaneously:

Metric	Symbol	What It Measures	Unit
Accuracy	R_acc	Task success rate on the benchmark	% (0–100)
Energy	Ê	Joules consumed per query	J/query
Latency	L̂	Wall-clock time per query	seconds
Cost	Ĉ	Marginal API/compute cost per query	$/query
Power	P	Average watts during inference	W

This multi-dimensional evaluation is why OpenJarvis finds Pareto-optimal specs: you can trade accuracy for latency, or accuracy for cost, and the search explores these tradeoffs systematically.

Why TOML? Because Specs need to be diffable, mergeable, and human-readable. A Spec is version-controlled alongside your code. When the optimizer proposes an edit, it produces a diff: “change agents.loop from ‘react’ to ‘codeact’ and intelligence.temperature from 0.7 to 0.6.” You can review the diff, accept it, or reject it — just like a code review.

What makes the Spec abstraction more useful than just a config file?

It is written in TOML instead of JSON or YAML It requires less disk space than other formats It is typed, evaluable across multiple dimensions (accuracy, latency, cost, energy), diffable, and optimizable — meaning the optimizer can search the space of Specs systematically

Chapter 4: LLM-Guided Spec Search

Now the payoff. We have a Spec that decomposes the system into primitives. We have a multi-dimensional evaluation function. The question is: how do we find the right Spec — the one where the local 9B model actually works?

The answer is Algorithm 1: LLM-Guided Spec Search. It is a loop where a cloud frontier model (the “proposer”) reads execution traces from the current Spec, identifies failure clusters, proposes coordinated edits across all four runtime primitives, and gates those edits to ensure they help without causing regression.

The Loop

Step 1: Evaluate

Run current Spec S on training examples. Collect execution traces (reasoning, tool calls, errors, outputs) and scores for each example.

↓

Step 2: Cluster Failures

Group failing traces by failure mode. Example clusters: “tool misuse errors”, “reasoning loops”, “wrong output format”, “context overflow”. Each cluster c_i is a coherent group of failures sharing a root cause.

↓

Step 3: Propose Edits

The cloud LLM reads the traces in each cluster and proposes a coordinated edit targeting that cluster. The edit can span ANY combination of the 4 runtime primitives.

↓

Step 4: Gate

Accept the edit ONLY if: (a) accuracy on the target cluster improves, AND (b) accuracy on all other clusters does not regress by more than ε=1%. This prevents fixing one thing from breaking another.

↻ repeat until stagnation (k=5 rounds with no improvement) or budget exhausted

Why coordinated edits matter: A failure like “the agent calls tools in the wrong order” might need changes in THREE primitives simultaneously: (1) Agent prompt rewrite to enforce sequential tool calls, (2) Engine context length increase so the model sees more history, (3) Tool description clarification so the model knows what each tool expects. Single-primitive search would try each alone and fail. Multi-primitive search finds the combination.

Concrete Example

Suppose 12 out of 40 training examples fail because the agent enters an infinite reasoning loop. The cloud model reads these 12 traces and proposes:

Agent edit: Add to system prompt: “If you have repeated the same action 3 times, stop and submit your current best answer.”
Intelligence edit: Lower temperature from 0.8 to 0.6 (reduces the randomness that triggers loops).
Engine edit: Reduce max_steps from 50 to 30 (hard cutoff on loops).

The gate evaluates: 9 of the 12 looping examples now succeed (target cluster improves). The other 28 examples still pass at the same rate (no regression beyond ε). Edit accepted.

LLM-Guided Spec Search — Interactive

Step through the optimization loop. Watch failure clusters get identified, edits get proposed across primitives, and the gate accept or reject changes. Adjust ε to see how the regression threshold affects convergence.

ε threshold 1.0%

Why does the gate require that non-target clusters do not regress by more than ε?

To prevent an edit that fixes one failure mode from breaking examples that were previously working — ensuring monotonic overall improvement To reduce the total number of API calls to the cloud model To ensure the optimizer converges to the global optimum

Chapter 5: The Gate & Reward

The gate is the quality control mechanism of Spec Search. Without it, the optimizer would oscillate — fixing one failure mode while breaking another, over and over. The gate ensures monotonic improvement: each accepted edit makes the system at least as good as before, with real progress on its target cluster.

The Gate Formula

Let S be the current Spec and S′ be the proposed Spec (after applying the edit). Let c be the target cluster and c′ be any non-target cluster. Let G_c(S) be the accuracy of Spec S on cluster c. The gate accepts edit e if and only if:

G_c(S′) > G_c(S) AND G_c′(S′) ≥ G_c′(S) − ε ∀ c′ ≠ c

In words: the target cluster must strictly improve, and every other cluster must not regress by more than ε. The default ε is 1%, meaning the optimizer tolerates at most 1 percentage point of regression on any non-target cluster.

Why not ε = 0? Because noise. If you have 40 training examples and 10 in cluster c′, one flipped example changes accuracy by 10%. Requiring zero regression would reject nearly every edit due to random variance. ε = 1% provides slack for noise while preventing real degradation.

Composite Reward for Intelligence Edits

When the edit involves changing the model (Intelligence primitive), the gate needs a richer reward signal than just accuracy. Running a bigger quantization costs more energy and latency. The composite reward balances these:

R(q, y) = α · R_acc(q, y) − β · Ê(q, y) − γ · L̂(q, y) − δ · Ĉ(q, y)

Where:

R_acc = accuracy (did the query succeed?): 0 or 1
Ê = normalized energy (joules per query, scaled to [0,1])
L̂ = normalized latency (seconds per query, scaled to [0,1])
Ĉ = normalized cost (dollars per query, scaled to [0,1])
(α, β, γ, δ) = (0.5, 0.1, 0.1, 0.3) — accuracy matters most, cost second, energy and latency tied third

Worked Example

Suppose the current Intelligence spec uses INT4 quantization and the optimizer proposes switching to INT8. Here is the gate check for one cluster:

Metric	INT4 (current)	INT8 (proposed)
R_acc (accuracy)	0.70	0.82
Ê (energy, normalized)	0.30	0.45
L̂ (latency, normalized)	0.25	0.35
Ĉ (cost, normalized)	0.20	0.20

Current reward:

R = 0.5(0.70) − 0.1(0.30) − 0.1(0.25) − 0.3(0.20) = 0.35 − 0.03 − 0.025 − 0.06 = 0.235

Proposed reward:

R′ = 0.5(0.82) − 0.1(0.45) − 0.1(0.35) − 0.3(0.20) = 0.41 − 0.045 − 0.035 − 0.06 = 0.270

R′ = 0.270 > R = 0.235. Target cluster improved. If non-target clusters also pass the ε check, this edit is accepted.

Notice the tradeoff baked into the weights. α = 0.5 means accuracy is dominant but not absolute. If INT8 gave only a 2% accuracy improvement but doubled energy and latency, the composite reward would decrease and the edit would be rejected. The reward function encodes the user’s preference: “I want a system that works well, doesn’t drain my battery, and responds quickly — roughly in that order.”

In the composite reward R = α·R_acc − β·Ê − γ·L̂ − δ·Ĉ with (α,β,γ,δ) = (0.5,0.1,0.1,0.3), what does the 0.3 weight on cost Ĉ imply?

Cost is the most important factor in the reward Cost is the second most important factor after accuracy — a high-cost edit needs a substantial accuracy gain to be accepted Cost is penalized less than energy and latency

Chapter 6: The Portability Experiment

The cleanest test of the Spec abstraction is a controlled substitution experiment. Take the same model (Qwen3.5-9B), the same benchmarks, and two different stacks: one designed as a monolith (OpenClaw, Hermes Agent), and one designed with the Spec abstraction (OpenJarvis). How much does each drop when you swap cloud for local?

Table 1 Results

The paper reports results on 8 benchmarks. Here are the four with the most dramatic differences:

Benchmark	Cloud (Opus 4.6)	OpenClaw + Qwen	Drop	OpenJarvis + Qwen	Drop
SWE-Bench	79.4%	45.7%	−33.7 pp	71.8%	−7.6 pp
LiveCodeBench	73.2%	47.8%	−25.4 pp	83.0%	+9.8 pp
PinchBench	88.5%	49.3%	−39.2 pp	100.0%	+11.5 pp
LiveResearchBench	82.7%	55.4%	−27.3 pp	91.0%	+8.3 pp

Read these numbers carefully. On LiveCodeBench, PinchBench, and LiveResearchBench, the local 9B model with an OpenJarvis spec exceeds the cloud model. A model 50× smaller, running on your laptop, outperforming Claude Opus 4.6 — because the surrounding system was configured for it. This is the strongest evidence that the substitution catastrophe is architectural, not capability-bounded.

What Changed?

The cloud model and the monolithic stacks share the exact same model weights (Qwen3.5-9B). The only difference is everything else:

OpenClaw + Qwen: Same prompts as OpenClaw-with-Opus. Same agent loop. Same tool descriptions. Same engine settings. Zero adaptation.
OpenJarvis + Qwen: Spec-searched prompts rewritten for the 9B model’s strengths. Agent loop switched to CodeAct (better for Qwen). Tool descriptions simplified and made more explicit. Engine tuned (temperature, top-k, context length). All changes found automatically by the Spec Search algorithm.

Portability Comparison

Side-by-side comparison of monolithic stacks vs. OpenJarvis on the same local model. Toggle between benchmarks to see the pattern.

Benchmark SWE-Bench

On LiveCodeBench, PinchBench, and LiveResearchBench, the OpenJarvis Qwen3.5-9B spec outperforms Claude Opus 4.6. How is this possible?

Qwen3.5-9B is actually a better model than Claude Opus 4.6 The spec search optimized the entire surrounding system (prompts, agent loop, tools, engine) specifically for these benchmarks with this model, finding configurations the cloud stack did not explore The benchmarks are easier for smaller models due to shorter context windows

Chapter 7: The Pareto Frontier

Performance is not a single number. A system that gets 95% accuracy but costs $0.50 per query is very different from one that gets 88% accuracy at $0.0006 per query. OpenJarvis explores this tradeoff space explicitly, producing a Pareto frontier — the set of specs where you cannot improve one metric without degrading another.

The Numbers

System	Avg Accuracy	Cost/Query	Latency	Where It Runs
Cloud (Claude Opus 4.6)	~80%	$0.48	~12s	Cloud
Cloud (GPT-4.1)	~75%	$0.32	~8s	Cloud
OpenJarvis Spec (Qwen 9B)	~77%	$0.0006	~3s	Your laptop

800× cost reduction. The cloud system costs $0.48 per query. The OpenJarvis on-device spec costs $0.0006 (electricity only). That is a factor of 800. For a user making 100 queries per day, that is $48/day vs. $0.06/day. Over a year: $17,520 vs. $22.

The latency improvement is equally dramatic: 4× faster. Cloud queries take ~12 seconds (network round-trip + cloud queue + inference). Local queries take ~3 seconds (just inference, no network). For interactive use, this is the difference between “waiting for a response” and “instant.”

The Pareto Structure

The paper reconstructs the accuracy-vs-cost Pareto frontier across all evaluated systems. The key insight: OpenJarvis specs cluster in the bottom-right corner — high accuracy, low cost. Cloud systems cluster in the top-right: high accuracy but extremely high cost. The Pareto frontier shows that once you cross a threshold of optimization effort, the on-device systems dominate the cost axis.

Accuracy vs. Cost Pareto Frontier

Hover over points to see system details. The dashed line shows the Pareto frontier. Systems below and to the right are dominated (worse on both axes). Notice how OpenJarvis specs cluster in the sweet spot: near-cloud accuracy at local-device cost.

The average accuracy gap between the best cloud system and the best OpenJarvis spec is just 3.2 pp across all 8 benchmarks. On 4 benchmarks, the gap is negative — the local spec wins. This 3.2 pp gap is the price of full data privacy, 800× cost reduction, and 4× latency reduction. For most personal AI use cases, that is a bargain.

What does the 800× cost reduction represent?

The cost of training the local model vs. the cloud model The one-time cost of running the spec search algorithm The per-query marginal cost: ~$0.48 for cloud API calls vs. ~$0.0006 for on-device electricity — because inference runs entirely locally after optimization

Chapter 8: Ablations & What the Optimizer Learns

The ablation study in Figure 8 of the paper answers two questions: (1) Does the proposer matter (LLM vs. evolutionary vs. random)? (2) Does the move space matter (1 primitive vs. 4 primitives)?

The 3×3 Grid

The authors cross three proposer strategies with three move-space sizes:

	1 Primitive	2 Primitives	4 Primitives
LLM Proposer	77.5%	87.0%	93.3%
Evolutionary	70.0%	79.5%	83.8%
Random	65.5%	75.0%	80.5%

(Numbers are averaged across benchmarks for illustration; actual per-benchmark ranges are: LLM-4 = 83–100%, Evolutionary-4 = 73–94.5%, Random-4 = 69–92.5%.)

Both axes matter, but move space has a larger effect. Moving from 1 to 4 primitives adds 5.5–16.5 pp regardless of proposer. Moving from random to LLM proposer adds ~10–14 pp regardless of move space. The two effects are roughly additive, which means they capture different sources of improvement.

Why the LLM Proposer Wins

The evolutionary proposer mutates edits randomly (e.g., “change temperature from 0.7 to 0.6”) and selects the best. The LLM proposer reads the failure traces and proposes targeted fixes. It can say: “These 8 failures all happen when the agent tries to use the bash tool for file editing — the bash tool description is confusing. I will: (1) clarify the bash tool description, (2) add a file_editor tool, (3) update the system prompt to prefer file_editor for edits.”

This is a coordinated, multi-primitive edit informed by causal diagnosis. The evolutionary proposer would need to stumble upon this combination by chance.

What Primitives Get Edited?

The paper reports the distribution of accepted edits across primitives (Table 10), and the pattern varies dramatically by benchmark type:

Benchmark Type	Intelligence	Agent	Tools	Engine
Coding (SWE-Bench, LiveCode)	44%	28%	18%	10%
Customer Service (Tau2)	16%	46%	24%	14%
Research (LiveResearch)	22%	20%	42%	16%

Different tasks need different knobs. Coding tasks benefit most from Intelligence edits (quantization, generation params) because code requires precise token-by-token generation. Customer service benefits most from Agent edits (prompt rewriting, reasoning loop) because conversation flow matters more than raw generation quality. Research tasks benefit most from Tool edits (retrieval config, search tool descriptions) because finding the right information is the bottleneck.

This is why single-primitive optimization fails. If you only tune the model (Intelligence), you leave 56–84% of the improvement on the table. The spec search must touch all four primitives.

Optimization Cost

LLM-guided spec search is 7–11× cheaper than single-primitive baselines at the same accuracy level. The reason: the LLM proposer makes informed edits, so fewer iterations are wasted. Random search at 4 primitives needs ~120 iterations to converge; LLM search needs ~15–25.

Ablation Grid: Proposer × Move Space

The grid shows average accuracy for each combination of proposer strategy and number of editable primitives. Brighter = higher accuracy. Both axes contribute — the best cell (LLM × 4 primitives) is brightest.

Table 10 shows that coding benchmarks have 44% Intelligence edits while customer service has 46% Agent edits. What does this tell us about single-primitive optimization?

Single-primitive optimization misses the majority of useful edits for any given task type — each task type needs a different mix of primitives, and only multi-primitive search captures this Single-primitive optimization is fine if you pick the right primitive for each task Intelligence edits are always the most important because the model is the core of the system

Chapter 9: Connections

Related Work

System	What It Optimizes	How OpenJarvis Differs
GEPA	Prompts via reflective mutation	GEPA optimizes prompts only; OpenJarvis optimizes all 4 runtime primitives jointly
DSPy / MIPROv2	Prompts + few-shot examples	DSPy optimizes within a single primitive (prompts). OpenJarvis adds Engine, Tools, Intelligence to the search space
LoRA	Model weights via low-rank adapters	Weight updates are just one of 4 primitives. OpenJarvis finds that Agent/Tool edits contribute 56–84% of improvement
Minions	Local-cloud collaboration at inference	Minions requires cloud at inference time. OpenJarvis uses cloud only during optimization; inference is fully local
Archon	Multi-LLM ensemble architecture	Archon ensembles multiple models. OpenJarvis optimizes a single-model stack for on-device deployment

Limitations

Optimization cost: Spec search requires cloud API calls during training. The 7–11× efficiency gain over baselines helps, but the upfront cost is nonzero (typically $50–200 per benchmark).
Benchmark specificity: A Spec optimized for SWE-Bench may not transfer well to customer service tasks. Cross-benchmark generalization is an open problem.
Hardware coupling: The Engine primitive depends on available hardware. A Spec optimized for an RTX 4090 may not be optimal for an M4 Max MacBook. Re-optimization is needed per device class.
Dynamic tasks: Spec search optimizes for a static benchmark distribution. If the user’s task distribution shifts over time, the Spec may need periodic re-optimization.

Cheat Sheet

OpenJarvis in 30 seconds:
• Problem: Cloud → local model swap drops 25–39 pp because stacks are monolithically coupled to their cloud model
• Solution: Decompose into 5 typed primitives (Intelligence, Engine, Agents, Tools/Memory, Learning). Search across all 4 runtime primitives jointly.
• Method: LLM-guided spec search — cloud model reads failure traces, proposes multi-primitive edits, gate ensures monotonic improvement
• Result: Local 9B model matches/exceeds cloud on 4/8 benchmarks. 800× lower cost, 4× lower latency. 3.2 pp average gap.
• Key ablation: Both proposer (LLM vs random) and move space (1 vs 4 primitives) matter. Different task types need different primitive mixes.

What to Read Next

GEPA — Reflective prompt evolution, the closest technique to OpenJarvis’s LLM proposer
LoRA — Parameter-efficient fine-tuning, the weight-update component of the Intelligence primitive
ReAct — The reasoning-then-acting agent loop, one of the Agent primitive options
Archon — Multi-LLM architecture search, a related approach to system-level optimization
DPO — Direct Preference Optimization, an alternative to RL for model alignment

What is the fundamental difference between OpenJarvis and systems like DSPy or GEPA?

OpenJarvis optimizes across all 4 runtime primitives (Intelligence, Engine, Agents, Tools) jointly, while DSPy/GEPA optimize within a single primitive (prompts only) OpenJarvis uses a larger language model for optimization OpenJarvis is open-source while DSPy/GEPA are proprietary

Personal AI, On Personal Devices