JIT Objectives — Veanors

Chapter 0: The Problem

You're writing the introduction to your research paper. You've been staring at it for an hour. You ask an LLM for help. It responds with:

"Replace technical phrases like 'hill-climb' with simpler alternatives"
"Some sentences could be broken up for readability"
"Consider moving implementation specifics to a later section"

This is useful advice. It's also utterly generic. A seasoned researcher in your field would have given you something far more specific: "Your core argument about why objectives should be induced rather than specified is buried in paragraph three. Lead with it."

Why does the LLM produce milquetoast feedback? Because it doesn't know what you specifically need right now. It has no idea whether you're struggling with the logical flow of an HCI paper, tightening a quantitative evaluation, or reframing a contribution for a different audience. So it defaults to advice that works for everyone and inspires no one.

The fundamental mismatch: LLM training objectives are defined far in advance and must work for all users. Post-training (RLHF) optimizes against many simultaneous objectives — reasoning, safety, helpfulness — which converges the model toward generic, committee-approved outputs. Even at interaction time, users struggle to articulate what they want ("Is it worth asking the model to critique my draft, or should I just ask for paper recommendations, or something else?"). The result: everyone gets the same bland output.

This isn't just an annoyance. Research shows that generic LLM outputs promote monocultures — steering users toward homogeneous, convergent thinking even when individual outputs appear creative. At a population level, everyone's writing starts to sound the same.

Why do LLMs produce generic outputs when asked for help with writing?

Training objectives must cover all possible users, so they converge to safe, generic outputs — and users struggle to specify what they actually need LLMs don't have enough parameters to produce specific outputs The training data doesn't contain specific writing advice

Chapter 1: The Key Insight

The core idea is deceptively simple. Instead of asking users to specify objectives (tedious, and they often don't know what they want), observe their behavior and infer the objective. Then optimize aggressively for that one inferred goal.

Think about what a skilled human collaborator does. If you hand a colleague your paper draft, they don't ask you to write a detailed specification of what kind of feedback you want. They look at what you're working on, infer what you're struggling with, and give advice calibrated to that specific struggle.

The calculus analogy: A user's objective over long stretches of time is complex and curved — their writing philosophy, their career goals, their aesthetic preferences. That's hard to model. But just like in calculus, even a complex curve can be approximated as a simple straight line over an infinitely small instant. JIT objectives capture these instantaneous goals: not "make me a better writer," but "clarify the research contribution in this abstract for HCI reviewers."

This reframes the entire interaction paradigm. Instead of:

Traditional

User writes prompt → LLM guesses what they want → Generic output

JIT Objectives

System observes user → Induces specific objective → Optimizes for THAT goal → Specialized output

The objective becomes a first-class interactive object: visible (you can see what the AI thinks you want), modifiable (you can correct it), and equipped to steer any number of downstream AI systems simultaneously.

When the authors applied this approach to their own paper-writing process, instead of generic syntax editing, the system produced an objective of "explain the system clearly in this CHI paper introduction." That single objective unlocked outputs that reworked the paragraph's logical flow to match related CHI papers, simulated feedback from likely CHI reviewers, and identified where the narrative veered away from describing the system.

What makes JIT objectives different from asking the user to write a better prompt?

JIT objectives are inferred automatically from passive observation of behavior, not manually specified — capturing goals users may not even know how to articulate JIT objectives use a different LLM model JIT objectives require users to fill out a detailed form

Chapter 2: JIT Objectives Architecture

The architecture has three stages. Each is simple on its own — the power comes from chaining them together with a shared objective.

Stage 1: Observe

Passively capture the user's context. This could be a browser screenshot, text from a document, cursor position, recent edits, or file attachments. The key word is passively — the user doesn't have to do anything special.

Stage 2: Induce

A vision-language model takes the observed context and infers candidate objectives. Each objective is a JSON object with three fields:

name

"Strengthen the narrative argument"

description

"Develop a compelling narrative that emphasizes how JIT objectives improve LLM systems by centering user needs with minimal developer effort."

weight

9 (estimated importance on a 1-10 scale)

Stage 3: Optimize

The induced objective is applied to downstream systems via two operators:

gen_objective: Prepended to any generator prompt, steering its output toward the user's goal (the actor)
eval_objective: Prepended to any evaluator prompt, shaping how outputs are scored and ranked (the critic)

The generate-then-rank pattern: This architecture maps directly to actor-critic systems in RL. The generator proposes candidates shaped by the objective. The evaluator scores those candidates against the same objective. The best candidate wins. Increasing the number of candidates (best-of-N) scales quality at inference time.

What are the three stages of the JIT objectives architecture?

Observe user context, induce candidate objectives from observations, optimize generation and evaluation using the induced objective Prompt, generate, evaluate Train, fine-tune, deploy

Chapter 3: Observation and Context

The quality of induced objectives depends entirely on the quality of the observations. Garbage in, garbage out. So what does the system actually observe?

Input modalities

The Poppins system (the paper's concrete instantiation) accepts three types of input:

Screenshots: The user's full browser window or desktop — captures visual layout, cursor position, what's visible, what tools are open
Text content: Extracted text from the current webpage — the actual words being written or read
File attachments: Images, PDFs, or other files the user uploads for targeted help

A single screenshot carries surprisingly rich information. A vision-language model can see that you're in Overleaf editing the System section, that you have comments from co-authors visible in the margin, that your cursor is positioned in the third paragraph, and that your references panel shows HCI papers. From this, it can infer: "User is iterating on the System section by integrating feedback from collaborators."

Context windows

Users can tune the temporal scope of objective induction. Some want micro-objectives for the next minute ("Highlight and delete other usages of an outdated system name"). Others want macro-objectives spanning weeks ("Improve the clarity of my academic writing"). The default targets a sweet spot: objectives for the current work session.

Why passive observation beats explicit prompting: When users manually prompt an LLM, they typically underspecify. "Give feedback on this draft" omits everything the system needs to know — what stage the draft is at, who the audience is, what specific aspect the user is struggling with. A screenshot captures all of this context implicitly. The user doesn't have to think about what to communicate because the system can see it.

Why does passive observation (like screenshots) produce better objectives than explicit user prompts?

Screenshots implicitly capture context — task stage, audience, tools, content — that users typically omit when writing manual prompts Screenshots have higher resolution than text Users always lie in their prompts

Chapter 4: Objective Induction

Objective induction is where the magic happens. The system takes raw observations and produces structured, actionable objectives. Let's trace through exactly how.

The induction process

A vision-language model receives the user's context (screenshot + text) and follows a chain-of-thought process:

Task domain: What field is the user working in? (e.g., academic writing, data analysis, design)
Stage of completion: Is this a rough draft, a polished revision, or a final check?
Potential audience: Who will read/use this? (e.g., CHI reviewers, a thesis committee, a client)
Ideal final output: What would success look like?
Anticipated reaction to assistance: What kind of help would the user welcome versus find annoying?

From this reasoning, the model produces multiple candidate objectives, each with a name, description, and importance weight.

Example: A researcher editing a paper abstract

The system observes a researcher editing an abstract in Overleaf with co-author comments visible. It might produce:

Objective 1 (weight: 9)

"Clarify the abstract's research contribution" — Ensure the abstract clearly communicates the novel contribution and distinguishes it from prior work, making the claim legible to CHI reviewers.

Objective 2 (weight: 7)

"Clarify the technical architecture" — Refine the description of the JIT objectives architecture, ensuring components and their relationships are clearly defined.

Objective 3 (weight: 5)

"Strengthen quantitative evidence" — Ensure the evaluation numbers and study design are presented compellingly.

The highest-weighted objective becomes the active one by default, but users can select, modify, or create alternatives.

Why multiple candidates matter: The system doesn't pretend to know exactly what you want. It generates a ranked list of plausible objectives and lets you confirm, modify, or override. This is fundamentally different from a system that silently assumes a single objective. Making the objective visible turns the AI's reasoning into a collaborative negotiation rather than a black-box guess.

What chain-of-thought factors does the system consider when inducing objectives?

Task domain, stage of completion, potential audience, ideal final output, and anticipated reaction to assistance Word count, grammar errors, and reading level Model size, temperature, and top-p sampling

Chapter 5: Downstream Specialization

Once an objective is induced, it powers two types of specialization. Both use the same lightweight mechanism: prepending the objective JSON to existing prompts.

Expertise generation (Poppins-experts)

Given the objective "Strengthen the narrative argument," the system generates expert perspectives. Not generic "writing expert" personas — deeply specialized ones:

A Technical Writing Specialist with background in system architecture documentation, referencing specific style guides and academic conventions
A Human-AI Interaction Researcher with publications on cognitive load in AI interfaces, who notices that the paper "could be clearer about the specific metrics used to assess objective accuracy"
A Systems Architecture Expert who observes that "internal component interfaces are well-defined but the paper could better specify how external systems would integrate"

Each expert comes with detailed background material retrieved via LLM search — specific publications, talks, projects, methodologies, and key ideas. This isn't surface-level persona prompting. The objective shapes what kind of expertise is relevant.

Tool generation (Poppins-tools)

Even more ambitiously, the system can generate entirely new interactive software tools tailored to the objective. From "Create clear visual representations of the AI system," Poppins generated:

A Component Relationship Diagram Builder — drag-and-drop UI for visualizing system architecture with different connection types (data flow, feedback, dependency)
An Architecture Template Gallery — curated layout options to apply to a diagram
A Component Style Synchronizer — tool for unifying colors and formatting across a diagram

Each participant gets UNIQUE tools: In the paper's user study, no two participants received the same generated tool. A scholarship essay writer got a "Cultural Perspective Highlighter." A researcher working on microcontrollers got a "Neural Architecture Search Explorer." A bioengineering student got a "Technical Protocol Generator." A fiction writer got a "Character Emotion Tracker." The objective is what makes each tool distinct.

How does a JIT objective transform expert generation from generic to specialized?

The objective determines WHAT KIND of expertise is relevant, producing experts with specific publications, methodologies, and perspectives aligned to the user's in-the-moment goal It uses a different LLM with more parameters It searches a database of real experts

Chapter 6: Evaluation Against Objectives

Generation is only half the story. The other half is evaluation — and this is where JIT objectives create the most dramatic improvements.

The problem with generic evaluation

Consider a standard LLM-as-a-judge setup. You generate 10 feedback candidates and ask the judge to pick the best one. Without a JIT objective, the judge evaluates on generic criteria — "overall quality," "intellectual rigor," "helpfulness." The result? Most candidates score similarly. The judge can't differentiate because it doesn't know what specifically matters.

Objective-aligned evaluation

Add the JIT objective "Strengthen the narrative argument" to the judge's prompt. Now it can distinguish between feedback that merely mentions the importance of narratives and feedback that provides concrete strategies for incorporating narrative structure. The scores spread out. The best candidate becomes clearly distinguishable.

The eval_objective operator: Just as gen_objective steers generation, eval_objective steers evaluation. Same JSON specification, same prepend-to-prompt mechanism, completely different effect. In generation, the objective shapes what's produced. In evaluation, the objective shapes what's selected. Together, they create a generate-then-rank pipeline where both the actor and the critic share a unified understanding of what "good" means.

Best-of-N scaling

The generate-then-rank architecture enables test-time compute scaling. Generate N candidates with gen_objective, then select the best with eval_objective. As N increases, quality improves — the evaluator has more candidates to choose from, and the objective ensures it picks the right one.

The paper tested N = 1, 10, and 100. Quality improved consistently with N, confirming that JIT evaluators are strong enough to identify the best candidate from a large pool.

Why does adding a JIT objective to an LLM evaluator produce better-differentiated scores?

The objective gives the evaluator specific criteria to distinguish candidates — e.g., filtering out feedback that mentions narratives without providing concrete strategies for narrative structure It uses a more powerful model for evaluation It increases the temperature parameter

Chapter 7: Results

The paper runs three evaluations, each more complex than the last: isolated objectives, objective-optimized outputs, and full tool generation in the wild.

Study 1: Accuracy and utility (N=14)

14 participants submitted browser traces from their daily work over three days, yielding 70 unique contexts. Results:

Accuracy: M = 2.04 on a -3 to +3 scale (75% rated "Accurate" or higher)
Usefulness: M = 2.18 (75% rated "Useful" or higher)
Win rates vs. baseline LLM: Expertise: 85.7%, Tool designs: 73.2%, Feedback: 71.4%

Study 2: Generalizability (N=205)

205 online participants submitted 410 workspace screenshots. The system performed live objective induction on each. Results:

Accuracy: M = 1.92 (76.6% rated "Accurate" or higher)
Usefulness: M = 2.06 (79.8% rated "Useful" or higher)
Win rates vs. baseline: 66-71% across expertise, tools, and feedback

Study 3: In-person use sessions (N=17)

17 participants used Poppins on their own writing tasks for one hour each. They compared Poppins-experts and Poppins-tools against a standard LLM chat baseline. Results:

Overall quality: Poppins significantly higher than baseline (p < .05)
Uniqueness: Every participant received different generated tools — a "Cultural Perspective Highlighter" for one, a "Technical Protocol Generator" for another
Preference split: 11 participants preferred Poppins-experts (textual feedback with objective steering), 6 preferred Poppins-tools (fully generated interactive UIs)

The baseline is not straw-man: Both JIT and baseline conditions used the same model (Claude Sonnet 3.7), the same user screenshot, and the same prompt. The ONLY difference was whether an induced objective was prepended. That single addition — a few sentences of JSON — produced 66-86% win rates. The objective is the mechanism that unlocks specialization.

What was the ONLY difference between the JIT condition and the baseline in the paper's experiments?

The JIT condition prepended an induced objective JSON to the prompt — same model, same screenshot, same base prompt otherwise The JIT condition used a larger model The JIT condition had access to user history over many sessions

Chapter 8: Interactive Objectives

The deepest contribution of the paper isn't the architecture or the win rates. It's the idea that objectives should be first-class interactive objects in the UI. What does that mean?

Visible

The user can see what the AI thinks they want. Instead of a black box that silently generates output, the system surfaces its inferred objective: "Strengthen the narrative argument (weight: 9)." This transparency is itself valuable — it prompts the user to reflect on their own goals.

Modifiable

The user can edit any part of the objective. Don't agree with "Strengthen the narrative argument"? Change it to "Tighten the quantitative evaluation section." The description and weight are editable too. Users can also select from alternative candidates, add entirely new objectives, or delete ones they don't want.

Steerable

A single objective controls multiple downstream systems simultaneously. Change the objective once, and the expertise generator, the tool builder, and the evaluation criteria all update together. This is far more efficient than manually adjusting prompts for each system independently.

The UI affordances: Poppins provides four actions on objectives: Select (pick a different candidate), Edit (rewrite any field), Add (manually author a new objective), Delete (remove unwanted objectives). These same actions apply to generated experts and tool designs — every intermediate generation is modifiable, not just the final output.

This design resolves a classic tension in adaptive interfaces. Traditional adaptive UIs suffer from unpredictability — buttons move, menus change, and users feel out of control. JIT objectives sidestep this by making the adaptation criterion itself visible and editable. The UI can change dramatically (generating an entirely new tool), but the user understands why and can steer the direction.

The "I would never have thought of this" effect

Several participants were struck by tools they never would have requested but found deeply useful. P19 on the Technical Protocol Generator: "This is something that I never would have thought about, and now I find it super helpful." This is the payoff of inference over explicit specification — the system can propose objectives and tools that expand the user's imagination of what AI assistance can look like.

What three properties make JIT objectives "first-class interactive objects"?

Visible (user sees the inferred objective), modifiable (user can edit it), and steerable (one objective controls multiple downstream systems) Fast, accurate, and cheap Trained, deployed, and monitored

Chapter 9: Connections

What JIT Objectives build on

Adaptive interfaces (Gajos et al., 2010): The long tradition of UIs that adjust based on user context. JIT objectives extend this with LLM generativity — instead of selecting from a finite set of pre-built adaptations, the system generates entirely new tools and interfaces.

User modeling (Fischer, 2001; Horvitz et al., 1998): Estimating user goals, effort, or capabilities to personalize systems. JIT objectives inherit this pipeline (observe → infer → adapt) but apply it to steer LLM generation rather than traditional UI selection.

AI chains (Wu et al., 2022): Chaining LLM calls where each step's output becomes the next step's input. JIT objectives add a shared objective that aligns all steps toward the same goal.

Prompt engineering / in-context learning: The practice of carefully crafting prompts. JIT objectives automate the hardest part — figuring out what to ask for in the first place.

What JIT Objectives enable

Generative user interfaces: Instead of pre-built UI components, systems can generate entirely novel interfaces shaped by user-specific objectives. Poppins demonstrates this is already feasible.

Test-time compute scaling with user alignment: Best-of-N sampling works far better when the evaluator has a clear objective. JIT objectives provide that objective automatically.

Anti-monoculture AI: By producing different objectives for different users in different contexts, JIT objectives break the homogenizing tendency of generic LLM outputs.

The bigger picture: JIT objectives point toward a future where AI systems don't just respond to what users say, but understand what users need. The objective is a simple mechanism — a few sentences of JSON — but it bridges the gulf between generic AI and personalized AI. The key insight: you don't need to retrain the model to specialize it. You just need to tell it what to optimize for, in the moment, for this specific user.

Cheat sheet

Core idea

Observe user → Infer objective → Optimize generation + evaluation for THAT objective

Mechanism

Objective JSON (name, description, weight) prepended to prompts via gen_objective / eval_objective operators

Key results

66-86% win rates over baseline LLM; unique tools per user; significantly higher quality ratings (p < .05)

Innovation

Objectives as first-class interactive objects: visible, modifiable, steerable

System

Poppins — browser extension + web app. Claude Sonnet 3.7 + o3-mini + GPT-4.1-mini

How do JIT objectives help counteract the "monoculture" problem of generic LLM outputs?

By producing DIFFERENT objectives for different users in different contexts, they break the homogenizing tendency — each user gets specialized tools and outputs unique to their situation By using a diverse set of training data By increasing the temperature parameter

Just-In-Time Objectives