Infer the user's goal from passive observation, then optimize everything downstream for that singular objective. 66-86% win rates over generic LLMs.
You're writing the introduction to your research paper. You've been staring at it for an hour. You ask an LLM for help. It responds with:
This is useful advice. It's also utterly generic. A seasoned researcher in your field would have given you something far more specific: "Your core argument about why objectives should be induced rather than specified is buried in paragraph three. Lead with it."
Why does the LLM produce milquetoast feedback? Because it doesn't know what you specifically need right now. It has no idea whether you're struggling with the logical flow of an HCI paper, tightening a quantitative evaluation, or reframing a contribution for a different audience. So it defaults to advice that works for everyone and inspires no one.
This isn't just an annoyance. Research shows that generic LLM outputs promote monocultures — steering users toward homogeneous, convergent thinking even when individual outputs appear creative. At a population level, everyone's writing starts to sound the same.
The core idea is deceptively simple. Instead of asking users to specify objectives (tedious, and they often don't know what they want), observe their behavior and infer the objective. Then optimize aggressively for that one inferred goal.
Think about what a skilled human collaborator does. If you hand a colleague your paper draft, they don't ask you to write a detailed specification of what kind of feedback you want. They look at what you're working on, infer what you're struggling with, and give advice calibrated to that specific struggle.
This reframes the entire interaction paradigm. Instead of:
The objective becomes a first-class interactive object: visible (you can see what the AI thinks you want), modifiable (you can correct it), and equipped to steer any number of downstream AI systems simultaneously.
When the authors applied this approach to their own paper-writing process, instead of generic syntax editing, the system produced an objective of "explain the system clearly in this CHI paper introduction." That single objective unlocked outputs that reworked the paragraph's logical flow to match related CHI papers, simulated feedback from likely CHI reviewers, and identified where the narrative veered away from describing the system.
The architecture has three stages. Each is simple on its own — the power comes from chaining them together with a shared objective.
Passively capture the user's context. This could be a browser screenshot, text from a document, cursor position, recent edits, or file attachments. The key word is passively — the user doesn't have to do anything special.
A vision-language model takes the observed context and infers candidate objectives. Each objective is a JSON object with three fields:
The induced objective is applied to downstream systems via two operators:
The quality of induced objectives depends entirely on the quality of the observations. Garbage in, garbage out. So what does the system actually observe?
The Poppins system (the paper's concrete instantiation) accepts three types of input:
A single screenshot carries surprisingly rich information. A vision-language model can see that you're in Overleaf editing the System section, that you have comments from co-authors visible in the margin, that your cursor is positioned in the third paragraph, and that your references panel shows HCI papers. From this, it can infer: "User is iterating on the System section by integrating feedback from collaborators."
Users can tune the temporal scope of objective induction. Some want micro-objectives for the next minute ("Highlight and delete other usages of an outdated system name"). Others want macro-objectives spanning weeks ("Improve the clarity of my academic writing"). The default targets a sweet spot: objectives for the current work session.
Objective induction is where the magic happens. The system takes raw observations and produces structured, actionable objectives. Let's trace through exactly how.
A vision-language model receives the user's context (screenshot + text) and follows a chain-of-thought process:
From this reasoning, the model produces multiple candidate objectives, each with a name, description, and importance weight.
The system observes a researcher editing an abstract in Overleaf with co-author comments visible. It might produce:
The highest-weighted objective becomes the active one by default, but users can select, modify, or create alternatives.
Once an objective is induced, it powers two types of specialization. Both use the same lightweight mechanism: prepending the objective JSON to existing prompts.
Given the objective "Strengthen the narrative argument," the system generates expert perspectives. Not generic "writing expert" personas — deeply specialized ones:
Each expert comes with detailed background material retrieved via LLM search — specific publications, talks, projects, methodologies, and key ideas. This isn't surface-level persona prompting. The objective shapes what kind of expertise is relevant.
Even more ambitiously, the system can generate entirely new interactive software tools tailored to the objective. From "Create clear visual representations of the AI system," Poppins generated:
Generation is only half the story. The other half is evaluation — and this is where JIT objectives create the most dramatic improvements.
Consider a standard LLM-as-a-judge setup. You generate 10 feedback candidates and ask the judge to pick the best one. Without a JIT objective, the judge evaluates on generic criteria — "overall quality," "intellectual rigor," "helpfulness." The result? Most candidates score similarly. The judge can't differentiate because it doesn't know what specifically matters.
Add the JIT objective "Strengthen the narrative argument" to the judge's prompt. Now it can distinguish between feedback that merely mentions the importance of narratives and feedback that provides concrete strategies for incorporating narrative structure. The scores spread out. The best candidate becomes clearly distinguishable.
The generate-then-rank architecture enables test-time compute scaling. Generate N candidates with gen_objective, then select the best with eval_objective. As N increases, quality improves — the evaluator has more candidates to choose from, and the objective ensures it picks the right one.
The paper tested N = 1, 10, and 100. Quality improved consistently with N, confirming that JIT evaluators are strong enough to identify the best candidate from a large pool.
The paper runs three evaluations, each more complex than the last: isolated objectives, objective-optimized outputs, and full tool generation in the wild.
14 participants submitted browser traces from their daily work over three days, yielding 70 unique contexts. Results:
205 online participants submitted 410 workspace screenshots. The system performed live objective induction on each. Results:
17 participants used Poppins on their own writing tasks for one hour each. They compared Poppins-experts and Poppins-tools against a standard LLM chat baseline. Results:
The deepest contribution of the paper isn't the architecture or the win rates. It's the idea that objectives should be first-class interactive objects in the UI. What does that mean?
The user can see what the AI thinks they want. Instead of a black box that silently generates output, the system surfaces its inferred objective: "Strengthen the narrative argument (weight: 9)." This transparency is itself valuable — it prompts the user to reflect on their own goals.
The user can edit any part of the objective. Don't agree with "Strengthen the narrative argument"? Change it to "Tighten the quantitative evaluation section." The description and weight are editable too. Users can also select from alternative candidates, add entirely new objectives, or delete ones they don't want.
A single objective controls multiple downstream systems simultaneously. Change the objective once, and the expertise generator, the tool builder, and the evaluation criteria all update together. This is far more efficient than manually adjusting prompts for each system independently.
This design resolves a classic tension in adaptive interfaces. Traditional adaptive UIs suffer from unpredictability — buttons move, menus change, and users feel out of control. JIT objectives sidestep this by making the adaptation criterion itself visible and editable. The UI can change dramatically (generating an entirely new tool), but the user understands why and can steer the direction.
Several participants were struck by tools they never would have requested but found deeply useful. P19 on the Technical Protocol Generator: "This is something that I never would have thought about, and now I find it super helpful." This is the payoff of inference over explicit specification — the system can propose objectives and tools that expand the user's imagination of what AI assistance can look like.
Adaptive interfaces (Gajos et al., 2010): The long tradition of UIs that adjust based on user context. JIT objectives extend this with LLM generativity — instead of selecting from a finite set of pre-built adaptations, the system generates entirely new tools and interfaces.
User modeling (Fischer, 2001; Horvitz et al., 1998): Estimating user goals, effort, or capabilities to personalize systems. JIT objectives inherit this pipeline (observe → infer → adapt) but apply it to steer LLM generation rather than traditional UI selection.
AI chains (Wu et al., 2022): Chaining LLM calls where each step's output becomes the next step's input. JIT objectives add a shared objective that aligns all steps toward the same goal.
Prompt engineering / in-context learning: The practice of carefully crafting prompts. JIT objectives automate the hardest part — figuring out what to ask for in the first place.
Generative user interfaces: Instead of pre-built UI components, systems can generate entirely novel interfaces shaped by user-specific objectives. Poppins demonstrates this is already feasible.
Test-time compute scaling with user alignment: Best-of-N sampling works far better when the evaluator has a clear objective. JIT objectives provide that objective automatically.
Anti-monoculture AI: By producing different objectives for different users in different contexts, JIT objectives break the homogenizing tendency of generic LLM outputs.