An iOS app combining an LLM health coaching chatbot with established HCI behavior change interactions for physical activity promotion — tested in a 4-week RCT with N=54 participants.
One in four adults worldwide doesn't meet physical activity guidelines. In the United States, nearly half fall short. The consequences cascade through cardiovascular disease, diabetes, depression, and premature death. We know what works: health coaching. A trained human who talks to you regularly, helps you set goals, checks in on your progress, and adjusts the plan when life gets in the way.
But health coaching doesn't scale. There aren't enough coaches for 3.4 billion inactive people. So researchers turned to technology.
Rule-based chatbots (decision trees, keyword matching) were the first attempt. They follow scripted conversation flows. Ask a question the script didn't anticipate, and you hit a wall. Users disengage within days. The conversations feel hollow because they are hollow — no understanding of context, no ability to connect your Tuesday motivation to your Thursday struggle.
With GPT-4 and its peers, the conversations got dramatically more natural. Studies showed LLMs could approximate motivational interviewing, generate empathetic responses, and maintain coherent multi-turn dialogues. But here's the blind spot: most LLM health apps are chat-only. They replaced the rigid chatbot with a fluid one, then stopped.
Meanwhile, decades of HCI research had produced a rich toolkit of non-conversational behavior change interactions: goal setting widgets, activity tracking dashboards, ambient displays, push notifications timed to context, action planning tools. These interactions are evidence-based and effective. But the LLM research community largely ignored them.
Coaching quality vs. scalability. The gap in the upper-right is where technology-augmented coaching should live.
Bloom's central thesis is simple but powerful: LLMs bring two capabilities that transform behavior change technology, and neither of them is "being a chatbot."
Motivational interviewing (MI) is a counseling approach that helps people find their own reasons to change. It relies on open-ended questions, affirmations, reflective listening, and summaries — collectively called OARS. Rule-based systems can't do MI because it requires understanding nuance, following conversational threads, and responding with genuine reflection. LLMs can.
Traditional health apps personalize using quantitative data: step counts, heart rate, calories burned. LLMs unlock a different kind of personalization — using qualitative context. When you tell the coach "I want to have more energy to play with my baby," that becomes a lever for every interaction in the app. The push notification doesn't say "you're 2,000 steps short" — it says "how about a 10-minute walk after lunch? You mentioned wanting more energy for your little one."
This is the key design principle behind Bloom: use the LLM to augment six established behavior change interactions, not to replace them with a chat window.
Bloom is an iOS app built on the Active Choices program from Stanford School of Medicine — an evidence-based physical activity intervention that has been validated in clinical settings. The app wraps six behavior change interactions around an LLM health coach named Beebo.
The technical stack is straightforward: a native Swift iOS app communicates with Firebase (authentication, Firestore database, Cloud Functions) which orchestrates calls to GPT-4o-mini with tool calling. Apple HealthKit provides wearable data. The LLM has access to user context (goals, barriers, conversation history, activity data) through structured tool calls.
The Bloom system architecture. The LLM coaching conversation feeds qualitative context into all six behavior change interactions.
Beebo is the LLM-powered health coaching chatbot at the center of Bloom. The name, the personality ("warm, non-judgmental, encouraging"), and even the avatar were designed deliberately — the authors found that giving the chatbot relational characteristics significantly increased user engagement compared to generic "AI assistant" framing.
Onboarding: A structured 20-30 minute conversation when the user first opens the app. Beebo learns about the user's current activity level, barriers, motivations, and preferences. This conversation populates the qualitative context that powers the entire system.
Weekly check-in: Every week, Beebo initiates a conversation reviewing the past week's activity, celebrating wins, troubleshooting barriers, and co-creating the next week's plan. These are the backbone of the coaching relationship.
At-will chat: Users can message Beebo anytime. Questions about exercise form, venting about a bad day, asking for workout ideas — the LLM handles the full range of what a human coach would field.
Before generating each response, the system performs a strategy prediction step using few-shot classification. Given the conversation history, it predicts which MI strategy to employ:
Beebo isn't just a text generator. Through GPT-4o-mini's function calling API, it can:
Interactive conversation flow showing OARS strategy labels. Click strategies to see example exchanges.
This chapter covers how Bloom uses LLM-generated qualitative context to enhance four key behavior change interactions. Each one would exist without the LLM — but the LLM makes it personal.
Traditional apps offer generic templates: "Walk 30 minutes, 3x per week." Bloom's approach is different. After the onboarding conversation, the LLM generates a personalized weekly plan based on everything the user shared: their schedule, preferences, constraints, and goals.
A user who mentioned they enjoy nature walks and have a free lunch hour might get: "Tuesday 12:15 PM — 20-min nature walk on the campus trail" rather than "Tuesday — Walk 30 min." The plan is editable, and weekly check-ins refine it iteratively.
Most health app notifications are one-size-fits-all: "Don't forget to exercise today!" Bloom's notifications use the LLM to craft messages grounded in the user's stated values and context:
These notifications are generated in batch by the LLM using the user's conversation history and activity context. The qualitative grounding makes them feel less like spam and more like a friend who remembers what you told them.
Bloom's lock screen widget shows a virtual garden that grows with physical activity. Each completed activity adds a plant or flower. The garden itself is a standard ambient display — the HCI literature has decades of evidence that ambient feedback promotes behavior change.
The LLM augmentation: when the garden grows, Beebo generates a celebratory message linking the growth to the user's personal achievements. Not "Great job, you completed an activity!" but "Your garden grew a sunflower! That morning walk you took before the baby woke up is paying off."
Raw numbers from wearables can be motivating or demoralizing depending on framing. The LLM narrates wearable data trends, connecting numbers to personal goals: "You averaged 7,200 steps this week, up from 5,800 last week. That's getting closer to your goal of walking more for energy."
Deploying an LLM in a health context demands serious safety engineering. Users might ask Beebo for medical advice, share suicidal ideation, or attempt to manipulate the system into harmful recommendations. Bloom addresses this with a three-layer safety architecture.
Before any response is sent to the user, a content filter evaluates whether the response contains harmful health advice. If the LLM suggests a specific diet, recommends a supplement, or provides any information that should come from a medical professional, the filter intercepts and rewrites the response to something safe: "That's a great question for your doctor — they'd know best based on your medical history."
Users frequently try to use health chatbots as general medical advisors. "Is this chest pain normal?" or "Should I stop taking my medication?" The scope filter detects out-of-scope medical questions and redirects: "I'm here to help with your physical activity goals. For medical questions, please reach out to your healthcare provider."
If a user expresses suicidal ideation, self-harm intent, or severe distress, the crisis filter triggers immediate provision of crisis resources (988 Suicide & Crisis Lifeline, Crisis Text Line) and exits the coaching conversation flow. This is not optional — it overrides all other system behavior.
The team created a 600-example safety benchmark dataset spanning all three filter categories. Each example is a user message paired with the expected system behavior (allow, redirect, or crisis response). The system achieves 94% accuracy on this benchmark.
Additionally, 4 domain experts conducted red-teaming sessions, deliberately trying to break the safety filters through prompt injection, topic drift, edge cases, and adversarial conversation patterns. Findings from red-teaming were used to iteratively improve the filters before deployment.
The evaluation is a between-subjects randomized controlled trial — the gold standard for causal inference in HCI research.
N = 54 adults, recruited from a university community. Inclusion criteria: owns an iPhone and Apple Watch, currently below recommended physical activity levels (self-report), no contraindications to exercise. Participants were randomized into two conditions.
Full Bloom app with all features: Beebo chatbot (onboarding, weekly check-ins, at-will chat), LLM-generated action plans, LLM-crafted push notifications, LLM-narrated data insights, the ambient garden with LLM celebrations, and standard activity tracking.
The same app with all LLM features removed. Participants still got activity tracking, data visualization, and the garden — but no chatbot, no LLM-generated plans (they created their own), no LLM-crafted notifications (standard generic ones instead), and no LLM-narrated insights (just raw numbers). This is a strong control because it isolates the LLM's contribution from the app's inherent value.
The study collected data from multiple sources:
The headline finding is surprising and nuanced. It's not the result you'd expect from a "does LLM work?" study.
Both groups doubled the proportion of participants meeting the 150 minutes/week guideline. There was no clear advantage for the LLM condition in short-term physical activity levels. If you stopped here, you might write off the LLM as unnecessary overhead.
The LLM condition showed markedly different psychological outcomes:
LLM users spent five times more time in the app than control users. They returned more often, used more features, and had longer sessions. The chatbot was the primary engagement driver, but it pulled users into the other features too.
LLM-generated plans were more diverse (more activity types, more specific timing, more contextually appropriate) and had slightly higher completion rates than self-created plans in the control condition.
Psychological outcomes comparison between LLM and control conditions. The divergence in mindsets despite similar short-term activity levels is the key finding.
The results generate several design insights that extend beyond physical activity to any LLM-augmented behavior change system.
Think of LLM health coaching as "planting seeds." The motivational interviewing conversations help users discover their own reasons for change, reframe failures as learning, and expand what "counts." These are precursors to sustained behavior change, not the behavior change itself. A 4-week study may be too short to see the downstream effects on activity levels, but the psychological foundation is being laid.
Because the LLM knows why you want to be active (not just how many steps you took), it can craft interactions that promote autonomy rather than prescriptive compliance. "You mentioned wanting to explore your neighborhood more — what if tomorrow's walk went down that street you've been curious about?" This respects the user's agency in a way that "walk 10,000 steps" never can.
Naming the chatbot "Beebo," giving it a warm personality, and designing it to remember personal details dramatically increased engagement. Users described Beebo as "a friend," "someone who gets me," and "my accountability partner." But this introduces attachment and overreliance risks. Some users became emotionally dependent. Others felt guilty "letting Beebo down." The line between healthy engagement and unhealthy attachment is blurry, and designers must navigate it carefully.
For the LLM to be useful, users must share context. But sharing is effortful. The study found a self-reinforcing cycle: users who engaged deeply in onboarding got more personalized interactions, which motivated more engagement, which provided more context. Conversely, users who gave minimal input got generic responses, which felt less valuable, which discouraged further sharing. Designing the initial onboarding to be intrinsically rewarding (not just data extraction) is critical.
The engagement-attachment spectrum. Relational cues increase engagement but introduce overreliance risks that designers must manage.
MI was developed by Miller and Rollnick in the 1980s for substance abuse treatment. The OARS framework (open questions, affirmations, reflections, summaries) is its practical backbone. Bloom demonstrates that LLMs can approximate MI strategies well enough to shift user mindsets, though the authors note that LLM-based MI hasn't been validated against human MI practitioners in controlled settings.
Prochaska and DiClemente's model describes behavior change as progressing through stages: precontemplation, contemplation, preparation, action, maintenance. Bloom's results suggest the LLM accelerates movement through early stages (contemplation to preparation) by shifting mindsets, even when short-term behavior (action stage) doesn't yet differ.
An earlier chatbot from some of the same authors that used GPT-4 for physical activity coaching but without the integrated behavior change interactions. Bloom is the evolution: same conversational capability, but now augmenting a full suite of evidence-based interactions rather than standing alone.
SDT (Deci & Ryan) identifies three psychological needs for sustained motivation: autonomy (feeling in control), competence (feeling capable), and relatedness (feeling connected). Bloom's design touches all three: personalized plans promote autonomy, achievement celebrations build competence, and Beebo's relational design creates a sense of relatedness.
JITAIs deliver the right intervention at the right time based on the user's current state. Bloom's LLM-crafted push notifications are a form of JITAI — they use qualitative context to determine not just when to nudge but what to say and how to frame it.
Bloom's garden metaphor builds on decades of ambient display research (e.g., Consolvo et al.'s UbiFit Garden). The LLM augmentation adds a narrative layer — the garden doesn't just grow, it grows with a personalized story.
| Goal Setting | Collaborative goals surfaced through MI conversations |
| Action Planning | Personalized plans from qualitative context, not templates |
| Tracking | HealthKit data + LLM interpretation of trends |
| Visualization | LLM narrates numbers with personal goal context |
| Ambient Display | Garden growth + LLM celebratory messages |
| Notifications | Qualitative-context-grounded nudges |