Lam, Shaikh, Xu, Guo, Yang, Heer, Landay, Bernstein — CHI 2026

Just-In-Time Objectives

Infer the user's goal from passive observation, then optimize everything downstream for that singular objective. 66-86% win rates over generic LLMs.

Prerequisites: LLM prompting basics + Human-AI interaction intuition
10
Chapters
5+
Simulations

Chapter 0: The Problem

You're writing the introduction to your research paper. You've been staring at it for an hour. You ask an LLM for help. It responds with:

This is useful advice. It's also utterly generic. A seasoned researcher in your field would have given you something far more specific: "Your core argument about why objectives should be induced rather than specified is buried in paragraph three. Lead with it."

Why does the LLM produce milquetoast feedback? Because it doesn't know what you specifically need right now. It has no idea whether you're struggling with the logical flow of an HCI paper, tightening a quantitative evaluation, or reframing a contribution for a different audience. So it defaults to advice that works for everyone and inspires no one.

The fundamental mismatch: LLM training objectives are defined far in advance and must work for all users. Post-training (RLHF) optimizes against many simultaneous objectives — reasoning, safety, helpfulness — which converges the model toward generic, committee-approved outputs. Even at interaction time, users struggle to articulate what they want ("Is it worth asking the model to critique my draft, or should I just ask for paper recommendations, or something else?"). The result: everyone gets the same bland output.

This isn't just an annoyance. Research shows that generic LLM outputs promote monocultures — steering users toward homogeneous, convergent thinking even when individual outputs appear creative. At a population level, everyone's writing starts to sound the same.

Why do LLMs produce generic outputs when asked for help with writing?

Chapter 1: The Key Insight

The core idea is deceptively simple. Instead of asking users to specify objectives (tedious, and they often don't know what they want), observe their behavior and infer the objective. Then optimize aggressively for that one inferred goal.

Think about what a skilled human collaborator does. If you hand a colleague your paper draft, they don't ask you to write a detailed specification of what kind of feedback you want. They look at what you're working on, infer what you're struggling with, and give advice calibrated to that specific struggle.

The calculus analogy: A user's objective over long stretches of time is complex and curved — their writing philosophy, their career goals, their aesthetic preferences. That's hard to model. But just like in calculus, even a complex curve can be approximated as a simple straight line over an infinitely small instant. JIT objectives capture these instantaneous goals: not "make me a better writer," but "clarify the research contribution in this abstract for HCI reviewers."

This reframes the entire interaction paradigm. Instead of:

Traditional
User writes prompt → LLM guesses what they want → Generic output
JIT Objectives
System observes user → Induces specific objective → Optimizes for THAT goal → Specialized output

The objective becomes a first-class interactive object: visible (you can see what the AI thinks you want), modifiable (you can correct it), and equipped to steer any number of downstream AI systems simultaneously.

When the authors applied this approach to their own paper-writing process, instead of generic syntax editing, the system produced an objective of "explain the system clearly in this CHI paper introduction." That single objective unlocked outputs that reworked the paragraph's logical flow to match related CHI papers, simulated feedback from likely CHI reviewers, and identified where the narrative veered away from describing the system.

What makes JIT objectives different from asking the user to write a better prompt?

Chapter 2: JIT Objectives Architecture

The architecture has three stages. Each is simple on its own — the power comes from chaining them together with a shared objective.

Stage 1: Observe

Passively capture the user's context. This could be a browser screenshot, text from a document, cursor position, recent edits, or file attachments. The key word is passively — the user doesn't have to do anything special.

Stage 2: Induce

A vision-language model takes the observed context and infers candidate objectives. Each objective is a JSON object with three fields:

name
"Strengthen the narrative argument"
description
"Develop a compelling narrative that emphasizes how JIT objectives improve LLM systems by centering user needs with minimal developer effort."
weight
9 (estimated importance on a 1-10 scale)

Stage 3: Optimize

The induced objective is applied to downstream systems via two operators:

The generate-then-rank pattern: This architecture maps directly to actor-critic systems in RL. The generator proposes candidates shaped by the objective. The evaluator scores those candidates against the same objective. The best candidate wins. Increasing the number of candidates (best-of-N) scales quality at inference time.
What are the three stages of the JIT objectives architecture?

Chapter 3: Observation and Context

The quality of induced objectives depends entirely on the quality of the observations. Garbage in, garbage out. So what does the system actually observe?

Input modalities

The Poppins system (the paper's concrete instantiation) accepts three types of input:

A single screenshot carries surprisingly rich information. A vision-language model can see that you're in Overleaf editing the System section, that you have comments from co-authors visible in the margin, that your cursor is positioned in the third paragraph, and that your references panel shows HCI papers. From this, it can infer: "User is iterating on the System section by integrating feedback from collaborators."

Context windows

Users can tune the temporal scope of objective induction. Some want micro-objectives for the next minute ("Highlight and delete other usages of an outdated system name"). Others want macro-objectives spanning weeks ("Improve the clarity of my academic writing"). The default targets a sweet spot: objectives for the current work session.

Why passive observation beats explicit prompting: When users manually prompt an LLM, they typically underspecify. "Give feedback on this draft" omits everything the system needs to know — what stage the draft is at, who the audience is, what specific aspect the user is struggling with. A screenshot captures all of this context implicitly. The user doesn't have to think about what to communicate because the system can see it.
Why does passive observation (like screenshots) produce better objectives than explicit user prompts?

Chapter 4: Objective Induction

Objective induction is where the magic happens. The system takes raw observations and produces structured, actionable objectives. Let's trace through exactly how.

The induction process

A vision-language model receives the user's context (screenshot + text) and follows a chain-of-thought process:

  1. Task domain: What field is the user working in? (e.g., academic writing, data analysis, design)
  2. Stage of completion: Is this a rough draft, a polished revision, or a final check?
  3. Potential audience: Who will read/use this? (e.g., CHI reviewers, a thesis committee, a client)
  4. Ideal final output: What would success look like?
  5. Anticipated reaction to assistance: What kind of help would the user welcome versus find annoying?

From this reasoning, the model produces multiple candidate objectives, each with a name, description, and importance weight.

Example: A researcher editing a paper abstract

The system observes a researcher editing an abstract in Overleaf with co-author comments visible. It might produce:

Objective 1 (weight: 9)
"Clarify the abstract's research contribution" — Ensure the abstract clearly communicates the novel contribution and distinguishes it from prior work, making the claim legible to CHI reviewers.
Objective 2 (weight: 7)
"Clarify the technical architecture" — Refine the description of the JIT objectives architecture, ensuring components and their relationships are clearly defined.
Objective 3 (weight: 5)
"Strengthen quantitative evidence" — Ensure the evaluation numbers and study design are presented compellingly.

The highest-weighted objective becomes the active one by default, but users can select, modify, or create alternatives.

Why multiple candidates matter: The system doesn't pretend to know exactly what you want. It generates a ranked list of plausible objectives and lets you confirm, modify, or override. This is fundamentally different from a system that silently assumes a single objective. Making the objective visible turns the AI's reasoning into a collaborative negotiation rather than a black-box guess.
What chain-of-thought factors does the system consider when inducing objectives?

Chapter 5: Downstream Specialization

Once an objective is induced, it powers two types of specialization. Both use the same lightweight mechanism: prepending the objective JSON to existing prompts.

Expertise generation (Poppins-experts)

Given the objective "Strengthen the narrative argument," the system generates expert perspectives. Not generic "writing expert" personas — deeply specialized ones:

Each expert comes with detailed background material retrieved via LLM search — specific publications, talks, projects, methodologies, and key ideas. This isn't surface-level persona prompting. The objective shapes what kind of expertise is relevant.

Tool generation (Poppins-tools)

Even more ambitiously, the system can generate entirely new interactive software tools tailored to the objective. From "Create clear visual representations of the AI system," Poppins generated:

Each participant gets UNIQUE tools: In the paper's user study, no two participants received the same generated tool. A scholarship essay writer got a "Cultural Perspective Highlighter." A researcher working on microcontrollers got a "Neural Architecture Search Explorer." A bioengineering student got a "Technical Protocol Generator." A fiction writer got a "Character Emotion Tracker." The objective is what makes each tool distinct.
How does a JIT objective transform expert generation from generic to specialized?

Chapter 6: Evaluation Against Objectives

Generation is only half the story. The other half is evaluation — and this is where JIT objectives create the most dramatic improvements.

The problem with generic evaluation

Consider a standard LLM-as-a-judge setup. You generate 10 feedback candidates and ask the judge to pick the best one. Without a JIT objective, the judge evaluates on generic criteria — "overall quality," "intellectual rigor," "helpfulness." The result? Most candidates score similarly. The judge can't differentiate because it doesn't know what specifically matters.

Objective-aligned evaluation

Add the JIT objective "Strengthen the narrative argument" to the judge's prompt. Now it can distinguish between feedback that merely mentions the importance of narratives and feedback that provides concrete strategies for incorporating narrative structure. The scores spread out. The best candidate becomes clearly distinguishable.

The eval_objective operator: Just as gen_objective steers generation, eval_objective steers evaluation. Same JSON specification, same prepend-to-prompt mechanism, completely different effect. In generation, the objective shapes what's produced. In evaluation, the objective shapes what's selected. Together, they create a generate-then-rank pipeline where both the actor and the critic share a unified understanding of what "good" means.

Best-of-N scaling

The generate-then-rank architecture enables test-time compute scaling. Generate N candidates with gen_objective, then select the best with eval_objective. As N increases, quality improves — the evaluator has more candidates to choose from, and the objective ensures it picks the right one.

The paper tested N = 1, 10, and 100. Quality improved consistently with N, confirming that JIT evaluators are strong enough to identify the best candidate from a large pool.

Why does adding a JIT objective to an LLM evaluator produce better-differentiated scores?

Chapter 7: Results

The paper runs three evaluations, each more complex than the last: isolated objectives, objective-optimized outputs, and full tool generation in the wild.

Study 1: Accuracy and utility (N=14)

14 participants submitted browser traces from their daily work over three days, yielding 70 unique contexts. Results:

Study 2: Generalizability (N=205)

205 online participants submitted 410 workspace screenshots. The system performed live objective induction on each. Results:

Study 3: In-person use sessions (N=17)

17 participants used Poppins on their own writing tasks for one hour each. They compared Poppins-experts and Poppins-tools against a standard LLM chat baseline. Results:

The baseline is not straw-man: Both JIT and baseline conditions used the same model (Claude Sonnet 3.7), the same user screenshot, and the same prompt. The ONLY difference was whether an induced objective was prepended. That single addition — a few sentences of JSON — produced 66-86% win rates. The objective is the mechanism that unlocks specialization.
What was the ONLY difference between the JIT condition and the baseline in the paper's experiments?

Chapter 8: Interactive Objectives

The deepest contribution of the paper isn't the architecture or the win rates. It's the idea that objectives should be first-class interactive objects in the UI. What does that mean?

Visible

The user can see what the AI thinks they want. Instead of a black box that silently generates output, the system surfaces its inferred objective: "Strengthen the narrative argument (weight: 9)." This transparency is itself valuable — it prompts the user to reflect on their own goals.

Modifiable

The user can edit any part of the objective. Don't agree with "Strengthen the narrative argument"? Change it to "Tighten the quantitative evaluation section." The description and weight are editable too. Users can also select from alternative candidates, add entirely new objectives, or delete ones they don't want.

Steerable

A single objective controls multiple downstream systems simultaneously. Change the objective once, and the expertise generator, the tool builder, and the evaluation criteria all update together. This is far more efficient than manually adjusting prompts for each system independently.

The UI affordances: Poppins provides four actions on objectives: Select (pick a different candidate), Edit (rewrite any field), Add (manually author a new objective), Delete (remove unwanted objectives). These same actions apply to generated experts and tool designs — every intermediate generation is modifiable, not just the final output.

This design resolves a classic tension in adaptive interfaces. Traditional adaptive UIs suffer from unpredictability — buttons move, menus change, and users feel out of control. JIT objectives sidestep this by making the adaptation criterion itself visible and editable. The UI can change dramatically (generating an entirely new tool), but the user understands why and can steer the direction.

The "I would never have thought of this" effect

Several participants were struck by tools they never would have requested but found deeply useful. P19 on the Technical Protocol Generator: "This is something that I never would have thought about, and now I find it super helpful." This is the payoff of inference over explicit specification — the system can propose objectives and tools that expand the user's imagination of what AI assistance can look like.

What three properties make JIT objectives "first-class interactive objects"?

Chapter 9: Connections

What JIT Objectives build on

Adaptive interfaces (Gajos et al., 2010): The long tradition of UIs that adjust based on user context. JIT objectives extend this with LLM generativity — instead of selecting from a finite set of pre-built adaptations, the system generates entirely new tools and interfaces.

User modeling (Fischer, 2001; Horvitz et al., 1998): Estimating user goals, effort, or capabilities to personalize systems. JIT objectives inherit this pipeline (observe → infer → adapt) but apply it to steer LLM generation rather than traditional UI selection.

AI chains (Wu et al., 2022): Chaining LLM calls where each step's output becomes the next step's input. JIT objectives add a shared objective that aligns all steps toward the same goal.

Prompt engineering / in-context learning: The practice of carefully crafting prompts. JIT objectives automate the hardest part — figuring out what to ask for in the first place.

What JIT Objectives enable

Generative user interfaces: Instead of pre-built UI components, systems can generate entirely novel interfaces shaped by user-specific objectives. Poppins demonstrates this is already feasible.

Test-time compute scaling with user alignment: Best-of-N sampling works far better when the evaluator has a clear objective. JIT objectives provide that objective automatically.

Anti-monoculture AI: By producing different objectives for different users in different contexts, JIT objectives break the homogenizing tendency of generic LLM outputs.

The bigger picture: JIT objectives point toward a future where AI systems don't just respond to what users say, but understand what users need. The objective is a simple mechanism — a few sentences of JSON — but it bridges the gulf between generic AI and personalized AI. The key insight: you don't need to retrain the model to specialize it. You just need to tell it what to optimize for, in the moment, for this specific user.

Cheat sheet

Core idea
Observe user → Infer objective → Optimize generation + evaluation for THAT objective
Mechanism
Objective JSON (name, description, weight) prepended to prompts via gen_objective / eval_objective operators
Key results
66-86% win rates over baseline LLM; unique tools per user; significantly higher quality ratings (p < .05)
Innovation
Objectives as first-class interactive objects: visible, modifiable, steerable
System
Poppins — browser extension + web app. Claude Sonnet 3.7 + o3-mini + GPT-4.1-mini
How do JIT objectives help counteract the "monoculture" problem of generic LLM outputs?