Can an LLM chatbot approximate the onboarding conversation of a real health coaching program? GPTCoach implements Active Choices with motivational interviewing strategies and wearable data integration via tool calling.
Physical inactivity is one of the largest preventable causes of chronic disease worldwide. One in four adults doesn't meet recommended activity guidelines. We know with clinical certainty what helps: one-on-one health coaching. A trained professional who talks to you, understands your barriers, and helps you build a plan that actually fits your life.
But coaching doesn't scale. A single coach handles dozens of clients at most. Sessions cost $50-200 per hour. Millions of people who would benefit from coaching will never receive it.
The solution seemed obvious: put the coaching into an app. Give people step counters, calorie trackers, push notifications. And to be fair, mobile health (mHealth) apps did help — a little. They can deliver quantitative nudges: "You're 2,000 steps short today."
But quantitative data misses the point of coaching. Health behavior change is deeply personal. Your barriers aren't just "not enough steps." They're a bad knee from college soccer, a new baby who wakes you at 5am, a complicated relationship with gym culture, a job that leaves you drained by 6pm. No step counter captures any of that.
The personalization-scalability tradeoff. Human coaches and mobile apps occupy opposite corners. The upper-right quadrant is empty.
GPTCoach's insight is that LLMs have a unique combination of capabilities that neither human coaches nor mobile apps possess alone:
Conversational flexibility. Unlike rule-based chatbots that follow decision trees, LLMs can hold open-ended conversations. They can ask follow-up questions, probe for context, and respond to unexpected directions — just like a human coach. A user mentions they used to be a competitive swimmer? The LLM can weave that into the goal-setting conversation twenty minutes later.
Knowledge breadth. LLMs have internet-scale knowledge about exercise physiology, injury management, behavioral science, and thousands of activity types. They can synthesize this knowledge in response to a specific user's situation.
Tool use for data integration. Modern LLMs can call functions — meaning they can query a user's actual wearable data mid-conversation. "Let me look at your step count from last month..." isn't scripted; it's the model deciding in real-time that fetching data would help the conversation.
The paper shows that vanilla GPT-4 with a simple health coaching prompt fails in predictable ways. It gives unsolicited advice. It prescribes workout plans before understanding your situation. It doesn't ask enough questions. It behaves like a know-it-all personal trainer, not a supportive coach. The insight isn't just "use an LLM" — it's how to structure the LLM to behave like a real coach.
Before building anything, the team interviewed 22 people: 12 health professionals and 10 potential recipients of health coaching. The professionals included health coaches, personal trainers, fitness instructors, physical therapists, and behavioral scientists. The recipients ranged from NCAA athletes to sedentary office workers to new parents.
The interviews revealed that effective health coaches serve three distinct roles:
Both coaches and recipients valued wearable data, but with a crucial caveat: data should guide, not drive. One health educator used this analogy: "How would driving a car be a different experience if you had no gauges in front of you?" Data provides awareness, but it shouldn't be the sole basis for coaching decisions.
Coaches often don't have time to analyze client data — "I can't scale that. I have, like, 20 clients." And when data shows a lack of progress, it can actually hurt motivation. Context is everything: the number says 3,000 steps, but was it a rest day? A sick day? A day when you walked your kid to school in the rain?
From the three coaching roles (facilitator, educator, supporter) and the data insights, the team distilled three core design principles for GPTCoach:
The three design principles derived from formative interviews with health professionals and potential recipients.
Every single health expert described coaching as facilitative, not prescriptive. "We're not advice givers." The chatbot should stay in the passenger seat, empowering clients to make changes rather than telling them what to do. This directly conflicts with LLMs' default behavior — they want to answer questions and solve problems. Training GPTCoach to ask questions instead of providing answers is a core engineering challenge.
When the chatbot does provide information, it must be tailored. Not "exercise 30 minutes a day" but "given your back pain and your 7am commute, here's what might work for you." The chatbot integrates both qualitative context (from the conversation) and quantitative context (from wearable data) to personalize every recommendation.
Health behavior change is emotionally charged. Many clients face anxieties about exercise, body image, past injuries, or identity conflicts. The chatbot must be warm, encouraging, and non-judgmental — never shaming, never pushing too hard.
GPTCoach implements the onboarding conversation of the Stanford Active Choices program — an evidence-based, clinically validated physical activity coaching program. The onboarding session covers introductions, past exercise experiences, barriers, motivations, and collaborative FITT-based goal setting (Frequency, Intensity, Time, Type).
The critical architectural decision: GPTCoach doesn't use a single system prompt. Instead, every user message passes through three sequential prompt chains, each calling a separate GPT agent:
GPTCoach's three-chain architecture. Each user message sequentially passes through the dialogue state, MI strategy, and tool call chains before generating a response. Click a chain to see its role.
Chain 1: Dialogue State Manager. The onboarding session is organized into a linear sequence of states: Onboarding, Program Introduction, Past Experience, Barriers, Motivation, Goal Setting, Advice, Goodbye. An external LLM agent classifies whether the current state's task is complete and advances to the next state when ready. Each state has its own prompt with specific subtasks and guidance from the Active Choices manual.
Chain 2: Motivational Interviewing. While Chain 1 manages what to talk about, Chain 2 manages how to say it. One agent selects an MI strategy from 11 options (drawn from the MISC coding scheme). A second agent generates the response conditioned on that strategy.
Chain 3: Tool Call Prediction. If the response generation step didn't already call a tool, this chain asks whether the response should be augmented with health data. If yes, it forces a visualize() call to query and display the user's wearable data.
Motivational interviewing (MI) is a counseling approach developed by William Miller and Stephen Rollnick in the 1980s. It helps people find their own internal motivation to change — rather than being told what to do by an authority figure. MI is the gold standard for health behavior change conversations.
MI practitioners use a core set of strategies known as OARS:
GPTCoach implements 11 strategies adapted from the Motivational Interviewing Skills Code (MISC). The MI chain picks one strategy per turn, then generates the response conditioned on it:
Distribution of MI strategies across all conversations. Question dominates (65.7%), followed by Giving Information (12.1%) and Affirm (5.2%). Hover over each bar for details.
The MITI (Motivational Interviewing Treatment Integrity) coding system classifies counselor behaviors as:
To demonstrate the value of the prompt chains, the team ran a counterfactual analysis. They took the first 5 turns of each real conversation, then simulated 10 different user responses about common activity barriers. Both GPTCoach and vanilla GPT-4 generated responses. The results were stark: vanilla GPT-4 frequently persuaded — giving unsolicited advice, telling users what they should do, and trying to solve problems the user hadn't asked it to solve.
GPTCoach can query and visualize a user's health data mid-conversation through tool calling. This isn't a bolted-on feature — it's a fundamental part of how the system personalizes coaching.
The LLM has access to two functions it can call at any time:
describe(data_source, date, granularity) — fetches summary statistics for a data source (step count, heart rate, workouts, etc.) at day/week/month granularity. Returns a natural language description to the model.visualize(data_source, date, granularity) — same as describe, but also renders an interactive chart in the chat for the user to see.The data pipeline works with Apple HealthKit via an iOS app built on Stanford's Spezi framework. Three months of historical data are uploaded to Google Cloud Firestore. When GPT-4 decides to call a tool, the backend fetches data from Firestore, computes aggregated statistics, and returns them as text. For visualize, an interactive chart (bar chart for counts like steps, line chart for rates like heart rate) also appears in the user's chat window.
The tool calling flow. GPT-4 decides when to query health data based on conversational context. Click to step through the flow.
GPTCoach made 59 tool calls across the 16 lab study conversations. Tool calls clustered in three dialogue states:
The team evaluated GPTCoach in a lab study with 16 participants. Each participant uploaded three months of HealthKit data, then had a one-hour coaching session with the chatbot (5 in-person, 11 remote via Zoom). Here's what happened.
The sample was deliberately diverse: ages 21-71 (mean 38.2), 10 female and 6 male, spanning all stages of change from precontemplation to maintenance, and activity levels from low to high. Most had basic AI knowledge; one was a novice, two were advanced.
The survey results were overwhelmingly positive:
User experience ratings (1-5 Likert scale). Bars show mean scores across 16 participants.
Participants recognized the facilitative approach: "It sort of met me where I was at. It first asked about contextual things before prescribing anything." They appreciated the personalization: "I really liked that it was accurate, that it was my personal thing and not just abstract pictures from the Internet."
Trust built incrementally. One participant started skeptical about back pain being taken seriously: "Usually systems don't take that into account, so I already don't have trust." But as GPTCoach acknowledged their concerns, trust grew over the conversation.
The average conversation lasted about 23 agent turns. GPTCoach spent the most turns on goal setting (24.5% of turns), past experience (14.8%), and advice (15.4%). This matches how human coaches allocate time — spending more on understanding and planning than on introductions.
GPTCoach is a technology probe — not a finished product. The team is explicit about what worked, what didn't, and what needs more thought.
LLM coaches should complement human coaches, not replace them. The system excels at initial onboarding, data integration, and always being available. But it cannot replace the embodied, relational connection of a human coach. One expert put it memorably: "I don't think that my type of job, instructor wise, will ever be taken... Even though it will have all the information, it's not personal."
Some participants wanted more concrete plans: specific days, times, exercises. Others were happy with high-level guidance. This suggests the system needs to adapt its specificity to each user's needs — potentially by asking "Would you like me to be more specific?" or detecting cues from conversation context.
Participants inevitably raised topics outside physical activity — nutrition, weight loss, mental health, sleep. The Active Choices program treats these as related but out-of-scope. GPTCoach was designed to acknowledge them briefly and redirect, but the boundaries are fuzzy. A participant mentioning depression alongside inactivity needs careful handling.
The team deliberately limited the study to a single supervised session. LLM health coaching carries real risks: inaccurate medical advice, encouraging harmful exercise with injuries, creating emotional dependency, or reinforcing unhealthy behaviors. The paper argues for several safeguards:
Several participants formed surprisingly strong connections in a single session: "It made me really think about exercise and how positive it can be. When it brought up, who do you do this for? What motivates you? It really touched my heart a little bit." This is simultaneously a success (the system created genuine engagement) and a risk (what happens when the system fails or says something wrong?).
GPTCoach directly led to Bloom, the same team's follow-up system. Where GPTCoach was a single-session technology probe, Bloom is a full iOS app tested in a 4-week RCT (N=54). Bloom keeps the coaching chatbot ("Beebo") but adds non-conversational behavior change interactions: goal setting widgets, ambient displays, contextual push notifications, and action planning tools. The key insight from GPTCoach — that LLMs should augment established HCI interactions, not replace them — became Bloom's central design principle.
GPTCoach implements the onboarding conversation of Stanford's Active Choices, a clinically validated physical activity counseling program. Active Choices is grounded in the transtheoretical model (stages of change) and social cognitive theory. The full program includes follow-up contacts over 6+ months; GPTCoach tackles only the initial onboarding.
MI (Miller & Rollnick, 1983) is a person-centered counseling approach for health behavior change. GPTCoach's MI chain implements strategies from the MISC coding scheme and is evaluated using MITI Code 4 — the standard instrument for assessing MI fidelity in both human and automated counselors.
GPTCoach sits in a growing body of work applying LLMs to health: mental health counseling (Chiu et al., 2024), health inference tasks (health question answering), conversational agents for diagnosis, and commercial products like WHOOP Coach. What distinguishes GPTCoach is its grounding in an evidence-based program and its rigorous MI evaluation by trained human coders.
GPTCoach's wearable data integration is an early example of LLM tool use in a real-world health context. The model decides when to call describe() or visualize() based on conversational context — the same paradigm later used in agent frameworks like LangChain, CrewAI, and OpenAI's Assistants API.