Saad-Falcon, Narayan, Manihani et al. — Stanford + Lambda Labs, 2026

Personal AI, On Personal Devices

Swapping a cloud model for a local 9B model drops accuracy by 25–39 pp. Not because the small model is dumb — because the entire stack was co-designed around the cloud model. OpenJarvis decomposes the stack into typed primitives and searches for a configuration where the 9B model matches or exceeds cloud performance — at 800× lower cost.

Prerequisites: LLM prompting basics + Intuition for agents/tool-use. That’s it.
10
Chapters
6
Simulations
0
Assumed Knowledge

Chapter 0: The Problem

You have a personal AI assistant running on your laptop. It helps you write code, search the web, manage files, answer questions. But it runs through the cloud — every query goes to Claude Opus 4.6 or GPT-4.1, costs money, leaks your data, and adds latency. You want to run the whole thing locally on your own hardware.

So you do the obvious thing: you swap the cloud model for a local one. You download Qwen3.5-9B, a capable 9-billion-parameter model that fits in your GPU’s VRAM. You point your existing AI stack at it — same prompts, same tools, same agent loop — just a different model endpoint.

And accuracy craters.

The substitution catastrophe: On OpenClaw (a coding agent stack), Claude Opus 4.6 scores 79.4% on SWE-Bench. Drop in Qwen3.5-9B with zero changes? 45.7%. That is a 33.7 percentage-point collapse. On Hermes Agent (a general assistant stack), the drop is 25–39 pp across eight benchmarks. The system doesn’t just degrade gracefully — it falls off a cliff.

Why so catastrophic? The prompts were written for a specific model. The agent loop assumes a certain reasoning depth. The tool descriptions expect a certain level of instruction-following. The system is a monolith — every piece was co-designed around the cloud model, and none of it adapts when you change the brain.

This is the substitution catastrophe: the gap between cloud and local is not primarily about model capability. It is about architectural coupling. A 9B model can do these tasks — if the surrounding system is reconfigured for it.

OpenJarvis proves this. By decomposing the stack into independently-tunable primitives and searching for the right configuration, a local 9B model matches or exceeds cloud performance on 4 out of 8 benchmarks, at 800× lower cost per query.

The Substitution Catastrophe

Accuracy of different AI stacks when you swap the cloud model (Claude Opus 4.6) for a local model (Qwen3.5-9B). Blue bars = cloud. Orange bars = local drop-in. Green bars = local with OpenJarvis spec. Click bars to see exact numbers.

Benchmark SWE-Bench
Why does swapping a cloud model for a local model cause such a large accuracy drop in existing AI stacks?

Chapter 1: The Key Insight

Existing AI stacks treat the system as a black box. You have a model, some prompts, an agent loop, and some tools. They are all tangled together in code. Changing one piece requires rewriting others. There is no clean interface between components.

OpenJarvis’s insight is: decompose the stack into typed, independently-optimizable primitives. If each piece has a well-defined interface and can be changed without breaking the others, then you can search for the right combination — and that search can be automated.

Monolithic Stack (Before)
Model + prompts + agent loop + tools are entangled in code. Changing the model breaks everything. Only option: accept the accuracy drop or stay in the cloud.
↓ decompose ↓
OpenJarvis Stack (After)
Five typed primitives: Intelligence, Engine, Agents, Tools & Memory, Learning. Each has a schema. Each can be changed independently. The optimizer can search across all of them.

Think of it like a car. A monolithic stack is a car where the engine, transmission, suspension, and wheels are welded together. Want a different engine? Too bad — you need a new car. OpenJarvis gives you a car with standardized bolt patterns. Swap the engine, keep the transmission. Change the suspension, keep the wheels. Each primitive is a replaceable module with defined inputs and outputs.

The drop is architectural, not capability-bound. When the authors gave Qwen3.5-9B the right prompts, the right agent loop, the right tool descriptions, and the right inference parameters — all tuned jointly — it matched or exceeded the cloud model on half the benchmarks. The small model was never the bottleneck. The configuration was.

This is profound. It means the gap between cloud and local AI is not an insurmountable capability chasm. It is a configuration problem — and configuration problems can be solved by search.

But search over what space? How do you enumerate the knobs? That requires defining the primitives precisely — which brings us to the five primitives of OpenJarvis.

What is the core realization that makes OpenJarvis possible?

Chapter 2: The Five Primitives

OpenJarvis decomposes every personal AI system into exactly five primitives. This is not an arbitrary choice — it is the minimal set that captures the full optimization surface. Miss one, and you leave performance on the table. Add more, and you introduce unnecessary complexity.

1. Intelligence

The model itself. Its architecture, its weights, its generation parameters (temperature, top-p, top-k, repetition penalty), and its quantization format (FP16, INT8, INT4, GGUF). This is what most people think of when they think of AI optimization — but as we’ll see, it accounts for only 16–44% of the edits that actually improve performance.

2. Engine

The inference runtime. This is how the model runs, not what it is. Options include Ollama, vLLM, llama.cpp, and SGLang. Engine parameters include batch size, KV cache strategy (paged vs. contiguous), speculative decoding, and context length. A model can behave very differently depending on its engine — llama.cpp with Q4_K_M quantization produces different outputs than vLLM with AWQ.

3. Agents

The reasoning loop. This is how the model decides what to do at each step. Options include ReAct (reason-then-act), CodeAct (generate and execute code as the action), Plan-then-Execute, and simple single-turn prompting. Agent parameters include the system prompt, the maximum number of reasoning steps, whether to use chain-of-thought, and the tool-use policy (parallel vs. sequential tool calls).

4. Tools & Memory

The external interfaces. What tools does the model have access to? How are they described in the prompt? What retrieval mechanisms are available (vector DB, keyword search, hybrid)? What persistent state does the system maintain about the user? Tool descriptions are surprisingly critical — a vague description can cause a small model to misuse a tool that a large model would use correctly.

5. Learning

The meta-primitive. Learning is the optimizer that updates the other four primitives from execution traces. It is what makes the system self-improving. Learning includes: which cloud model serves as the optimizer, what kind of edits it proposes, the gating criteria for accepting edits, and the training data selection strategy.

Learning is special: it is the only primitive that does not run at inference time. Once optimization is done, the Learning primitive’s job is finished. The resulting spec (Intelligence + Engine + Agents + Tools/Memory) runs entirely on-device. Learning exists only during the optimization phase — it is the compiler, not the runtime.
The Five Primitives

Click each primitive to see its editable fields and example values. The Learning primitive (bottom) optimizes the other four.

Why this particular decomposition? Because it covers the four dimensions of variation the authors observed in real AI stacks:

DimensionPrimitiveExample Edit
What the model knowsIntelligenceSwitch from INT4 to INT8 quantization
How the model runsEngineSwitch from Ollama to vLLM, enable speculative decoding
How the model reasonsAgentsSwitch from ReAct to CodeAct, rewrite system prompt
What the model accessesTools & MemoryAdd a code sandbox tool, refine search tool description
Why is the Learning primitive NOT included in the final on-device spec?

Chapter 3: The Spec Abstraction

A Spec is a typed configuration object that fully specifies a personal AI system. It is TOML-serializable, shareable, versionable, and — crucially — evaluable. Given a Spec and a benchmark, you can measure accuracy, latency, energy, and cost per query in a single evaluation pass.

Here is what an actual Spec looks like:

toml
# OpenJarvis Spec — Qwen3.5-9B for SWE-Bench

[intelligence]
model       = "Qwen/Qwen3.5-9B"
quantization= "awq-int4"
temperature = 0.6
top_p       = 0.95
top_k       = 40
max_tokens  = 4096
rep_penalty = 1.05

[engine]
runtime     = "vllm"
batch_size  = 1
kv_cache    = "paged"
context_len = 32768
spec_decode = false

[agents]
loop        = "codeact"
max_steps   = 30
cot         = true
tool_policy = "sequential"
system_prompt = """
You are a software engineer. Read the issue.
Write a patch that fixes the bug.
Always test your code before submitting.
Use the bash tool to navigate the repo.
"""

[tools]
enabled     = ["bash", "file_editor", "search"]
sandbox     = true
timeout_s   = 120

[memory]
retrieval   = "none"
context_window = "sliding"
Think of the Spec as a recipe. Sharing a Spec is like sharing a recipe for a dish. The ingredients (model weights) are the same for everyone who downloads Qwen3.5-9B. The recipe (the Spec) tells you how to combine them — what temperature to cook at, what order to add ingredients, what tools to use. Two different recipes for the same ingredients can produce wildly different dishes.

Joint Evaluation

A Spec is not evaluated on accuracy alone. The OpenJarvis evaluation function measures five dimensions simultaneously:

MetricSymbolWhat It MeasuresUnit
AccuracyRaccTask success rate on the benchmark% (0–100)
EnergyÊJoules consumed per queryJ/query
LatencyWall-clock time per queryseconds
CostĈMarginal API/compute cost per query$/query
PowerPAverage watts during inferenceW

This multi-dimensional evaluation is why OpenJarvis finds Pareto-optimal specs: you can trade accuracy for latency, or accuracy for cost, and the search explores these tradeoffs systematically.

Why TOML? Because Specs need to be diffable, mergeable, and human-readable. A Spec is version-controlled alongside your code. When the optimizer proposes an edit, it produces a diff: “change agents.loop from ‘react’ to ‘codeact’ and intelligence.temperature from 0.7 to 0.6.” You can review the diff, accept it, or reject it — just like a code review.
What makes the Spec abstraction more useful than just a config file?

Chapter 4: LLM-Guided Spec Search

Now the payoff. We have a Spec that decomposes the system into primitives. We have a multi-dimensional evaluation function. The question is: how do we find the right Spec — the one where the local 9B model actually works?

The answer is Algorithm 1: LLM-Guided Spec Search. It is a loop where a cloud frontier model (the “proposer”) reads execution traces from the current Spec, identifies failure clusters, proposes coordinated edits across all four runtime primitives, and gates those edits to ensure they help without causing regression.

The Loop

Step 1: Evaluate
Run current Spec S on training examples. Collect execution traces (reasoning, tool calls, errors, outputs) and scores for each example.
Step 2: Cluster Failures
Group failing traces by failure mode. Example clusters: “tool misuse errors”, “reasoning loops”, “wrong output format”, “context overflow”. Each cluster ci is a coherent group of failures sharing a root cause.
Step 3: Propose Edits
The cloud LLM reads the traces in each cluster and proposes a coordinated edit targeting that cluster. The edit can span ANY combination of the 4 runtime primitives.
Step 4: Gate
Accept the edit ONLY if: (a) accuracy on the target cluster improves, AND (b) accuracy on all other clusters does not regress by more than ε=1%. This prevents fixing one thing from breaking another.
↻ repeat until stagnation (k=5 rounds with no improvement) or budget exhausted
Why coordinated edits matter: A failure like “the agent calls tools in the wrong order” might need changes in THREE primitives simultaneously: (1) Agent prompt rewrite to enforce sequential tool calls, (2) Engine context length increase so the model sees more history, (3) Tool description clarification so the model knows what each tool expects. Single-primitive search would try each alone and fail. Multi-primitive search finds the combination.

Concrete Example

Suppose 12 out of 40 training examples fail because the agent enters an infinite reasoning loop. The cloud model reads these 12 traces and proposes:

The gate evaluates: 9 of the 12 looping examples now succeed (target cluster improves). The other 28 examples still pass at the same rate (no regression beyond ε). Edit accepted.

LLM-Guided Spec Search — Interactive

Step through the optimization loop. Watch failure clusters get identified, edits get proposed across primitives, and the gate accept or reject changes. Adjust ε to see how the regression threshold affects convergence.

ε threshold 1.0%
Why does the gate require that non-target clusters do not regress by more than ε?

Chapter 5: The Gate & Reward

The gate is the quality control mechanism of Spec Search. Without it, the optimizer would oscillate — fixing one failure mode while breaking another, over and over. The gate ensures monotonic improvement: each accepted edit makes the system at least as good as before, with real progress on its target cluster.

The Gate Formula

Let S be the current Spec and S′ be the proposed Spec (after applying the edit). Let c be the target cluster and c′ be any non-target cluster. Let Gc(S) be the accuracy of Spec S on cluster c. The gate accepts edit e if and only if:

Gc(S′) > Gc(S)     AND     Gc′(S′) ≥ Gc′(S) − ε    ∀ c′ ≠ c

In words: the target cluster must strictly improve, and every other cluster must not regress by more than ε. The default ε is 1%, meaning the optimizer tolerates at most 1 percentage point of regression on any non-target cluster.

Why not ε = 0? Because noise. If you have 40 training examples and 10 in cluster c′, one flipped example changes accuracy by 10%. Requiring zero regression would reject nearly every edit due to random variance. ε = 1% provides slack for noise while preventing real degradation.

Composite Reward for Intelligence Edits

When the edit involves changing the model (Intelligence primitive), the gate needs a richer reward signal than just accuracy. Running a bigger quantization costs more energy and latency. The composite reward balances these:

R(q, y) = α · Racc(q, y) − β · Ê(q, y) − γ · L̂(q, y) − δ · Ĉ(q, y)

Where:

Worked Example

Suppose the current Intelligence spec uses INT4 quantization and the optimizer proposes switching to INT8. Here is the gate check for one cluster:

MetricINT4 (current)INT8 (proposed)
Racc (accuracy)0.700.82
Ê (energy, normalized)0.300.45
L̂ (latency, normalized)0.250.35
Ĉ (cost, normalized)0.200.20

Current reward:

R = 0.5(0.70) − 0.1(0.30) − 0.1(0.25) − 0.3(0.20) = 0.35 − 0.03 − 0.025 − 0.06 = 0.235

Proposed reward:

R′ = 0.5(0.82) − 0.1(0.45) − 0.1(0.35) − 0.3(0.20) = 0.41 − 0.045 − 0.035 − 0.06 = 0.270

R′ = 0.270 > R = 0.235. Target cluster improved. If non-target clusters also pass the ε check, this edit is accepted.

Notice the tradeoff baked into the weights. α = 0.5 means accuracy is dominant but not absolute. If INT8 gave only a 2% accuracy improvement but doubled energy and latency, the composite reward would decrease and the edit would be rejected. The reward function encodes the user’s preference: “I want a system that works well, doesn’t drain my battery, and responds quickly — roughly in that order.”
In the composite reward R = α·Racc − β·Ê − γ·L̂ − δ·Ĉ with (α,β,γ,δ) = (0.5,0.1,0.1,0.3), what does the 0.3 weight on cost Ĉ imply?

Chapter 6: The Portability Experiment

The cleanest test of the Spec abstraction is a controlled substitution experiment. Take the same model (Qwen3.5-9B), the same benchmarks, and two different stacks: one designed as a monolith (OpenClaw, Hermes Agent), and one designed with the Spec abstraction (OpenJarvis). How much does each drop when you swap cloud for local?

Table 1 Results

The paper reports results on 8 benchmarks. Here are the four with the most dramatic differences:

BenchmarkCloud (Opus 4.6)OpenClaw + QwenDropOpenJarvis + QwenDrop
SWE-Bench79.4%45.7%−33.7 pp71.8%−7.6 pp
LiveCodeBench73.2%47.8%−25.4 pp83.0%+9.8 pp
PinchBench88.5%49.3%−39.2 pp100.0%+11.5 pp
LiveResearchBench82.7%55.4%−27.3 pp91.0%+8.3 pp
Read these numbers carefully. On LiveCodeBench, PinchBench, and LiveResearchBench, the local 9B model with an OpenJarvis spec exceeds the cloud model. A model 50× smaller, running on your laptop, outperforming Claude Opus 4.6 — because the surrounding system was configured for it. This is the strongest evidence that the substitution catastrophe is architectural, not capability-bounded.

What Changed?

The cloud model and the monolithic stacks share the exact same model weights (Qwen3.5-9B). The only difference is everything else:

Portability Comparison

Side-by-side comparison of monolithic stacks vs. OpenJarvis on the same local model. Toggle between benchmarks to see the pattern.

Benchmark SWE-Bench
On LiveCodeBench, PinchBench, and LiveResearchBench, the OpenJarvis Qwen3.5-9B spec outperforms Claude Opus 4.6. How is this possible?

Chapter 7: The Pareto Frontier

Performance is not a single number. A system that gets 95% accuracy but costs $0.50 per query is very different from one that gets 88% accuracy at $0.0006 per query. OpenJarvis explores this tradeoff space explicitly, producing a Pareto frontier — the set of specs where you cannot improve one metric without degrading another.

The Numbers

SystemAvg AccuracyCost/QueryLatencyWhere It Runs
Cloud (Claude Opus 4.6)~80%$0.48~12sCloud
Cloud (GPT-4.1)~75%$0.32~8sCloud
OpenJarvis Spec (Qwen 9B)~77%$0.0006~3sYour laptop
800× cost reduction. The cloud system costs $0.48 per query. The OpenJarvis on-device spec costs $0.0006 (electricity only). That is a factor of 800. For a user making 100 queries per day, that is $48/day vs. $0.06/day. Over a year: $17,520 vs. $22.

The latency improvement is equally dramatic: 4× faster. Cloud queries take ~12 seconds (network round-trip + cloud queue + inference). Local queries take ~3 seconds (just inference, no network). For interactive use, this is the difference between “waiting for a response” and “instant.”

The Pareto Structure

The paper reconstructs the accuracy-vs-cost Pareto frontier across all evaluated systems. The key insight: OpenJarvis specs cluster in the bottom-right corner — high accuracy, low cost. Cloud systems cluster in the top-right: high accuracy but extremely high cost. The Pareto frontier shows that once you cross a threshold of optimization effort, the on-device systems dominate the cost axis.

Accuracy vs. Cost Pareto Frontier

Hover over points to see system details. The dashed line shows the Pareto frontier. Systems below and to the right are dominated (worse on both axes). Notice how OpenJarvis specs cluster in the sweet spot: near-cloud accuracy at local-device cost.

The average accuracy gap between the best cloud system and the best OpenJarvis spec is just 3.2 pp across all 8 benchmarks. On 4 benchmarks, the gap is negative — the local spec wins. This 3.2 pp gap is the price of full data privacy, 800× cost reduction, and 4× latency reduction. For most personal AI use cases, that is a bargain.

What does the 800× cost reduction represent?

Chapter 8: Ablations & What the Optimizer Learns

The ablation study in Figure 8 of the paper answers two questions: (1) Does the proposer matter (LLM vs. evolutionary vs. random)? (2) Does the move space matter (1 primitive vs. 4 primitives)?

The 3×3 Grid

The authors cross three proposer strategies with three move-space sizes:

1 Primitive2 Primitives4 Primitives
LLM Proposer77.5%87.0%93.3%
Evolutionary70.0%79.5%83.8%
Random65.5%75.0%80.5%

(Numbers are averaged across benchmarks for illustration; actual per-benchmark ranges are: LLM-4 = 83–100%, Evolutionary-4 = 73–94.5%, Random-4 = 69–92.5%.)

Both axes matter, but move space has a larger effect. Moving from 1 to 4 primitives adds 5.5–16.5 pp regardless of proposer. Moving from random to LLM proposer adds ~10–14 pp regardless of move space. The two effects are roughly additive, which means they capture different sources of improvement.

Why the LLM Proposer Wins

The evolutionary proposer mutates edits randomly (e.g., “change temperature from 0.7 to 0.6”) and selects the best. The LLM proposer reads the failure traces and proposes targeted fixes. It can say: “These 8 failures all happen when the agent tries to use the bash tool for file editing — the bash tool description is confusing. I will: (1) clarify the bash tool description, (2) add a file_editor tool, (3) update the system prompt to prefer file_editor for edits.”

This is a coordinated, multi-primitive edit informed by causal diagnosis. The evolutionary proposer would need to stumble upon this combination by chance.

What Primitives Get Edited?

The paper reports the distribution of accepted edits across primitives (Table 10), and the pattern varies dramatically by benchmark type:

Benchmark TypeIntelligenceAgentToolsEngine
Coding (SWE-Bench, LiveCode)44%28%18%10%
Customer Service (Tau2)16%46%24%14%
Research (LiveResearch)22%20%42%16%
Different tasks need different knobs. Coding tasks benefit most from Intelligence edits (quantization, generation params) because code requires precise token-by-token generation. Customer service benefits most from Agent edits (prompt rewriting, reasoning loop) because conversation flow matters more than raw generation quality. Research tasks benefit most from Tool edits (retrieval config, search tool descriptions) because finding the right information is the bottleneck.

This is why single-primitive optimization fails. If you only tune the model (Intelligence), you leave 56–84% of the improvement on the table. The spec search must touch all four primitives.

Optimization Cost

LLM-guided spec search is 7–11× cheaper than single-primitive baselines at the same accuracy level. The reason: the LLM proposer makes informed edits, so fewer iterations are wasted. Random search at 4 primitives needs ~120 iterations to converge; LLM search needs ~15–25.

Ablation Grid: Proposer × Move Space

The grid shows average accuracy for each combination of proposer strategy and number of editable primitives. Brighter = higher accuracy. Both axes contribute — the best cell (LLM × 4 primitives) is brightest.

Table 10 shows that coding benchmarks have 44% Intelligence edits while customer service has 46% Agent edits. What does this tell us about single-primitive optimization?

Chapter 9: Connections

Related Work

SystemWhat It OptimizesHow OpenJarvis Differs
GEPAPrompts via reflective mutationGEPA optimizes prompts only; OpenJarvis optimizes all 4 runtime primitives jointly
DSPy / MIPROv2Prompts + few-shot examplesDSPy optimizes within a single primitive (prompts). OpenJarvis adds Engine, Tools, Intelligence to the search space
LoRAModel weights via low-rank adaptersWeight updates are just one of 4 primitives. OpenJarvis finds that Agent/Tool edits contribute 56–84% of improvement
MinionsLocal-cloud collaboration at inferenceMinions requires cloud at inference time. OpenJarvis uses cloud only during optimization; inference is fully local
ArchonMulti-LLM ensemble architectureArchon ensembles multiple models. OpenJarvis optimizes a single-model stack for on-device deployment

Limitations

Cheat Sheet

OpenJarvis in 30 seconds:
Problem: Cloud → local model swap drops 25–39 pp because stacks are monolithically coupled to their cloud model
Solution: Decompose into 5 typed primitives (Intelligence, Engine, Agents, Tools/Memory, Learning). Search across all 4 runtime primitives jointly.
Method: LLM-guided spec search — cloud model reads failure traces, proposes multi-primitive edits, gate ensures monotonic improvement
Result: Local 9B model matches/exceeds cloud on 4/8 benchmarks. 800× lower cost, 4× lower latency. 3.2 pp average gap.
Key ablation: Both proposer (LLM vs random) and move space (1 vs 4 primitives) matter. Different task types need different primitive mixes.

What to Read Next

What is the fundamental difference between OpenJarvis and systems like DSPy or GEPA?