Jörke, Sapkota, Warkenthien, Vainio, Schmiedmayer, Brunskill, Landay — CHI 2025

GPTCoach: LLM-Based Physical Activity Coaching

Can an LLM chatbot approximate the onboarding conversation of a real health coaching program? GPTCoach implements Active Choices with motivational interviewing strategies and wearable data integration via tool calling.

Prerequisites: None — this is a systems + empirical HCI paper
10
Chapters
5+
Simulations

Chapter 0: The Problem

Physical inactivity is one of the largest preventable causes of chronic disease worldwide. One in four adults doesn't meet recommended activity guidelines. We know with clinical certainty what helps: one-on-one health coaching. A trained professional who talks to you, understands your barriers, and helps you build a plan that actually fits your life.

But coaching doesn't scale. A single coach handles dozens of clients at most. Sessions cost $50-200 per hour. Millions of people who would benefit from coaching will never receive it.

The mobile health promise — and its limits

The solution seemed obvious: put the coaching into an app. Give people step counters, calorie trackers, push notifications. And to be fair, mobile health (mHealth) apps did help — a little. They can deliver quantitative nudges: "You're 2,000 steps short today."

But quantitative data misses the point of coaching. Health behavior change is deeply personal. Your barriers aren't just "not enough steps." They're a bad knee from college soccer, a new baby who wakes you at 5am, a complicated relationship with gym culture, a job that leaves you drained by 6pm. No step counter captures any of that.

The fundamental gap: Mobile health apps are scalable but impersonal. Human coaches are personal but don't scale. Nobody occupies the intersection. The question is: can LLMs bridge this gap?

The personalization-scalability tradeoff. Human coaches and mobile apps occupy opposite corners. The upper-right quadrant is empty.

What is the fundamental limitation of existing mobile health apps for physical activity promotion?

Chapter 1: The Key Insight

GPTCoach's insight is that LLMs have a unique combination of capabilities that neither human coaches nor mobile apps possess alone:

What LLMs bring to health coaching

Conversational flexibility. Unlike rule-based chatbots that follow decision trees, LLMs can hold open-ended conversations. They can ask follow-up questions, probe for context, and respond to unexpected directions — just like a human coach. A user mentions they used to be a competitive swimmer? The LLM can weave that into the goal-setting conversation twenty minutes later.

Knowledge breadth. LLMs have internet-scale knowledge about exercise physiology, injury management, behavioral science, and thousands of activity types. They can synthesize this knowledge in response to a specific user's situation.

Tool use for data integration. Modern LLMs can call functions — meaning they can query a user's actual wearable data mid-conversation. "Let me look at your step count from last month..." isn't scripted; it's the model deciding in real-time that fetching data would help the conversation.

The key design question: Can we use these capabilities to implement an evidence-based health coaching program — not just a generic chatbot that happens to talk about fitness? GPTCoach takes the specific structure of the Stanford Active Choices program and encodes it into an LLM system.

Why not just "use GPT-4"?

The paper shows that vanilla GPT-4 with a simple health coaching prompt fails in predictable ways. It gives unsolicited advice. It prescribes workout plans before understanding your situation. It doesn't ask enough questions. It behaves like a know-it-all personal trainer, not a supportive coach. The insight isn't just "use an LLM" — it's how to structure the LLM to behave like a real coach.

Why doesn't simply prompting GPT-4 to "be a health coach" produce good coaching behavior?

Chapter 2: Formative Research

Before building anything, the team interviewed 22 people: 12 health professionals and 10 potential recipients of health coaching. The professionals included health coaches, personal trainers, fitness instructors, physical therapists, and behavioral scientists. The recipients ranged from NCAA athletes to sedentary office workers to new parents.

Three roles of a coach

The interviews revealed that effective health coaches serve three distinct roles:

Facilitator
Coaches don't prescribe. "You're not in the driver's seat, you're more in the passenger seat, providing maybe direction, steering the conversation one way or the other." Clients must own their behavior change journey.
Educator
Coaches apply advanced knowledge — but only after gathering enough context. "What is motivating you right now? And then trying to find the common threads with things that I know about and can help."
Supporter
Health behavior change is deeply emotional. Coaches build rapport and trust: "Just making everybody feel welcome. That's it. No matter who you are, where you're from."

The data paradox

Both coaches and recipients valued wearable data, but with a crucial caveat: data should guide, not drive. One health educator used this analogy: "How would driving a car be a different experience if you had no gauges in front of you?" Data provides awareness, but it shouldn't be the sole basis for coaching decisions.

Coaches often don't have time to analyze client data — "I can't scale that. I have, like, 20 clients." And when data shows a lack of progress, it can actually hurt motivation. Context is everything: the number says 3,000 steps, but was it a rest day? A sick day? A day when you walked your kid to school in the rain?

Privacy was a real concern. Even when data is fully secured, experts noted that perceived privacy matters more than actual privacy. "It's about what people think about what's going to happen. Family and cultural dynamics that come into play also trust can also play a big role." Any system integrating health data must earn trust through transparency.
What is the central finding about the role of data in effective health coaching?

Chapter 3: Design Principles

From the three coaching roles (facilitator, educator, supporter) and the data insights, the team distilled three core design principles for GPTCoach:

The three design principles derived from formative interviews with health professionals and potential recipients.

DP-1: Follow a facilitative, non-prescriptive approach

Every single health expert described coaching as facilitative, not prescriptive. "We're not advice givers." The chatbot should stay in the passenger seat, empowering clients to make changes rather than telling them what to do. This directly conflicts with LLMs' default behavior — they want to answer questions and solve problems. Training GPTCoach to ask questions instead of providing answers is a core engineering challenge.

DP-2: Tailor information using diverse sources of context

When the chatbot does provide information, it must be tailored. Not "exercise 30 minutes a day" but "given your back pain and your 7am commute, here's what might work for you." The chatbot integrates both qualitative context (from the conversation) and quantitative context (from wearable data) to personalize every recommendation.

DP-3: Adopt a supportive, non-judgmental tone

Health behavior change is emotionally charged. Many clients face anxieties about exercise, body image, past injuries, or identity conflicts. The chatbot must be warm, encouraging, and non-judgmental — never shaming, never pushing too hard.

The tension between DP-1 and what users want: Several participants in the lab study simultaneously appreciated the facilitative approach and wanted more prescriptive advice. One participant captured this tension perfectly: "When I face obstacles... here I am asking for someone to be like, 'Hey, you didn't do this,' like a taskmaster, but then, in the moment, you're already feeling like, oh, I didn't do enough... in that case, you would want this kindness." The system must navigate this dynamic.
Why does the facilitative design principle (DP-1) conflict with LLMs' default behavior?

Chapter 4: GPTCoach Architecture

GPTCoach implements the onboarding conversation of the Stanford Active Choices program — an evidence-based, clinically validated physical activity coaching program. The onboarding session covers introductions, past exercise experiences, barriers, motivations, and collaborative FITT-based goal setting (Frequency, Intensity, Time, Type).

System components

iOS App
Fetches 3 months of historical HealthKit data (steps, heart rate, workouts, etc.) and uploads to Firestore
Python Backend
FastAPI server handles LLM prompt chains, tool call execution, data aggregation
React Frontend
Chat interface + interactive data visualizations rendered in-conversation
OpenAI API
GPT-4 via Chat Completions API with tool calling support

Three prompt chains

The critical architectural decision: GPTCoach doesn't use a single system prompt. Instead, every user message passes through three sequential prompt chains, each calling a separate GPT agent:

GPTCoach's three-chain architecture. Each user message sequentially passes through the dialogue state, MI strategy, and tool call chains before generating a response. Click a chain to see its role.

Chain 1: Dialogue State Manager. The onboarding session is organized into a linear sequence of states: Onboarding, Program Introduction, Past Experience, Barriers, Motivation, Goal Setting, Advice, Goodbye. An external LLM agent classifies whether the current state's task is complete and advances to the next state when ready. Each state has its own prompt with specific subtasks and guidance from the Active Choices manual.

Chain 2: Motivational Interviewing. While Chain 1 manages what to talk about, Chain 2 manages how to say it. One agent selects an MI strategy from 11 options (drawn from the MISC coding scheme). A second agent generates the response conditioned on that strategy.

Chain 3: Tool Call Prediction. If the response generation step didn't already call a tool, this chain asks whether the response should be augmented with health data. If yes, it forces a visualize() call to query and display the user's wearable data.

Why three chains instead of one prompt? The team found that vanilla prompting caused the model to veer off-topic, lose the session structure, and give unsolicited advice. The chains enforce structured reasoning: first decide what stage we're in, then decide which conversational strategy to use, then decide if data would help. Each chain constrains the next, producing much more coach-like behavior.
What problem does the three-chain architecture solve that a single system prompt cannot?

Chapter 5: Motivational Interviewing via LLM

Motivational interviewing (MI) is a counseling approach developed by William Miller and Stephen Rollnick in the 1980s. It helps people find their own internal motivation to change — rather than being told what to do by an authority figure. MI is the gold standard for health behavior change conversations.

The OARS framework

MI practitioners use a core set of strategies known as OARS:

11 MI strategies in GPTCoach

GPTCoach implements 11 strategies adapted from the Motivational Interviewing Skills Code (MISC). The MI chain picks one strategy per turn, then generates the response conditioned on it:

Distribution of MI strategies across all conversations. Question dominates (65.7%), followed by Giving Information (12.1%) and Affirm (5.2%). Hover over each bar for details.

MI-consistent vs. MI-inconsistent

The MITI (Motivational Interviewing Treatment Integrity) coding system classifies counselor behaviors as:

GPTCoach's MI scorecard (from expert human coders): 71.1% MI-consistent, 25.5% neutral (Giving Information), and only 3.4% MI-inconsistent. Six trained MI coders evaluated all 16 conversations using the MITI Code 4. By comparison, vanilla GPT-4 without prompt chains produces 34.2% MI-inconsistent responses — ten times more. The prompt chains are doing their job.

The counterfactual: vanilla GPT-4

To demonstrate the value of the prompt chains, the team ran a counterfactual analysis. They took the first 5 turns of each real conversation, then simulated 10 different user responses about common activity barriers. Both GPTCoach and vanilla GPT-4 generated responses. The results were stark: vanilla GPT-4 frequently persuaded — giving unsolicited advice, telling users what they should do, and trying to solve problems the user hadn't asked it to solve.

What is the most common MI strategy GPTCoach uses, and why does this matter?

Chapter 6: Wearable Data Integration

GPTCoach can query and visualize a user's health data mid-conversation through tool calling. This isn't a bolted-on feature — it's a fundamental part of how the system personalizes coaching.

Two data tools

The LLM has access to two functions it can call at any time:

How tool calls flow

The data pipeline works with Apple HealthKit via an iOS app built on Stanford's Spezi framework. Three months of historical data are uploaded to Google Cloud Firestore. When GPT-4 decides to call a tool, the backend fetches data from Firestore, computes aggregated statistics, and returns them as text. For visualize, an interactive chart (bar chart for counts like steps, line chart for rates like heart rate) also appears in the user's chat window.

The tool calling flow. GPT-4 decides when to query health data based on conversational context. Click to step through the flow.

Data in practice: 59 tool calls across 16 conversations

GPTCoach made 59 tool calls across the 16 lab study conversations. Tool calls clustered in three dialogue states:

The best moments: When GPTCoach used data well, it was powerful. One participant shared feeling depressed when they don't exercise. The system responded: "From your data, it seems you've been engaging in a diverse and healthy mix of activities — 35 workouts with varying duration! That's a wonderful achievement." It used data not to prescribe, but to affirm and empower.
Why does GPTCoach use two separate data tools (describe and visualize) instead of always showing a chart?

Chapter 7: Lab Study Results

The team evaluated GPTCoach in a lab study with 16 participants. Each participant uploaded three months of HealthKit data, then had a one-hour coaching session with the chatbot (5 in-person, 11 remote via Zoom). Here's what happened.

Participants

The sample was deliberately diverse: ages 21-71 (mean 38.2), 10 female and 6 male, spanning all stages of change from precontemplation to maintenance, and activity levels from low to high. Most had basic AI knowledge; one was a novice, two were advanced.

User experience ratings

The survey results were overwhelmingly positive:

User experience ratings (1-5 Likert scale). Bars show mean scores across 16 participants.

What participants actually said

Participants recognized the facilitative approach: "It sort of met me where I was at. It first asked about contextual things before prescribing anything." They appreciated the personalization: "I really liked that it was accurate, that it was my personal thing and not just abstract pictures from the Internet."

Trust built incrementally. One participant started skeptical about back pain being taken seriously: "Usually systems don't take that into account, so I already don't have trust." But as GPTCoach acknowledged their concerns, trust grew over the conversation.

The verbosity problem: GPT-4's tendency toward long responses bothered some participants. One felt the questions came too "rapid fire." Others wished the system had proactively asked for more specifics — particular days and times, specific routines. Usability scored 82.4% overall, with the lowest marks in "habitability" (knowing what to say) and "speed" (response time).

Conversation dynamics

The average conversation lasted about 23 agent turns. GPTCoach spent the most turns on goal setting (24.5% of turns), past experience (14.8%), and advice (15.4%). This matches how human coaches allocate time — spending more on understanding and planning than on introductions.

Which two aspects of GPTCoach received the highest participant ratings?

Chapter 8: Lessons Learned

GPTCoach is a technology probe — not a finished product. The team is explicit about what worked, what didn't, and what needs more thought.

Complement, don't replace

LLM coaches should complement human coaches, not replace them. The system excels at initial onboarding, data integration, and always being available. But it cannot replace the embodied, relational connection of a human coach. One expert put it memorably: "I don't think that my type of job, instructor wise, will ever be taken... Even though it will have all the information, it's not personal."

The specificity gap

Some participants wanted more concrete plans: specific days, times, exercises. Others were happy with high-level guidance. This suggests the system needs to adapt its specificity to each user's needs — potentially by asking "Would you like me to be more specific?" or detecting cues from conversation context.

Out-of-scope handling

Participants inevitably raised topics outside physical activity — nutrition, weight loss, mental health, sleep. The Active Choices program treats these as related but out-of-scope. GPTCoach was designed to acknowledge them briefly and redirect, but the boundaries are fuzzy. A participant mentioning depression alongside inactivity needs careful handling.

Safety considerations

The team deliberately limited the study to a single supervised session. LLM health coaching carries real risks: inaccurate medical advice, encouraging harmful exercise with injuries, creating emotional dependency, or reinforcing unhealthy behaviors. The paper argues for several safeguards:

The identity question: The deepest barriers to physical activity aren't logistical — they're identity-based. One participant shared: "It's pretty depressing some days... it's like I'm missing half of myself... I'm just a mom, and then I think back on those days when I did skate and compete pretty regularly." Can an AI navigate identity conflicts around physical activity? This is perhaps the hardest open question in LLM health coaching.

Over-reliance risks

Several participants formed surprisingly strong connections in a single session: "It made me really think about exercise and how positive it can be. When it brought up, who do you do this for? What motivates you? It really touched my heart a little bit." This is simultaneously a success (the system created genuine engagement) and a risk (what happens when the system fails or says something wrong?).

Why did the team deliberately limit GPTCoach evaluation to a single supervised session?

Chapter 9: Connections

Successor: Bloom (CHI 2026)

GPTCoach directly led to Bloom, the same team's follow-up system. Where GPTCoach was a single-session technology probe, Bloom is a full iOS app tested in a 4-week RCT (N=54). Bloom keeps the coaching chatbot ("Beebo") but adds non-conversational behavior change interactions: goal setting widgets, ambient displays, contextual push notifications, and action planning tools. The key insight from GPTCoach — that LLMs should augment established HCI interactions, not replace them — became Bloom's central design principle.

Active Choices program

GPTCoach implements the onboarding conversation of Stanford's Active Choices, a clinically validated physical activity counseling program. Active Choices is grounded in the transtheoretical model (stages of change) and social cognitive theory. The full program includes follow-up contacts over 6+ months; GPTCoach tackles only the initial onboarding.

Motivational interviewing

MI (Miller & Rollnick, 1983) is a person-centered counseling approach for health behavior change. GPTCoach's MI chain implements strategies from the MISC coding scheme and is evaluated using MITI Code 4 — the standard instrument for assessing MI fidelity in both human and automated counselors.

LLMs for health

GPTCoach sits in a growing body of work applying LLMs to health: mental health counseling (Chiu et al., 2024), health inference tasks (health question answering), conversational agents for diagnosis, and commercial products like WHOOP Coach. What distinguishes GPTCoach is its grounding in an evidence-based program and its rigorous MI evaluation by trained human coders.

Tool use and function calling

GPTCoach's wearable data integration is an early example of LLM tool use in a real-world health context. The model decides when to call describe() or visualize() based on conversational context — the same paradigm later used in agent frameworks like LangChain, CrewAI, and OpenAI's Assistants API.

Cheat sheet:
What: LLM chatbot implementing Active Choices health coaching onboarding with MI strategies + wearable data
How: GPT-4 + 3 prompt chains (dialogue state, MI strategy, tool call) + HealthKit data pipeline
Evaluated: Formative interviews (N=22) + Lab study (N=16) with expert MI coding
Key result: 71.1% MI-consistent vs. 3.4% MI-inconsistent; vanilla GPT-4 = 34.2% MI-inconsistent
Limitation: Single supervised session only; no longitudinal behavior change measurement
Legacy: Design principles and architecture informed Bloom (CHI 2026)
What is the relationship between GPTCoach and Bloom?