Bloom: LLM-Augmented Behavior Change

Chapter 0: The Problem

One in four adults worldwide doesn't meet physical activity guidelines. In the United States, nearly half fall short. The consequences cascade through cardiovascular disease, diabetes, depression, and premature death. We know what works: health coaching. A trained human who talks to you regularly, helps you set goals, checks in on your progress, and adjusts the plan when life gets in the way.

But health coaching doesn't scale. There aren't enough coaches for 3.4 billion inactive people. So researchers turned to technology.

The chatbot era: rigid and disappointing

Rule-based chatbots (decision trees, keyword matching) were the first attempt. They follow scripted conversation flows. Ask a question the script didn't anticipate, and you hit a wall. Users disengage within days. The conversations feel hollow because they are hollow — no understanding of context, no ability to connect your Tuesday motivation to your Thursday struggle.

LLM chatbots: better talk, same problem

With GPT-4 and its peers, the conversations got dramatically more natural. Studies showed LLMs could approximate motivational interviewing, generate empathetic responses, and maintain coherent multi-turn dialogues. But here's the blind spot: most LLM health apps are chat-only. They replaced the rigid chatbot with a fluid one, then stopped.

Meanwhile, decades of HCI research had produced a rich toolkit of non-conversational behavior change interactions: goal setting widgets, activity tracking dashboards, ambient displays, push notifications timed to context, action planning tools. These interactions are evidence-based and effective. But the LLM research community largely ignored them.

The gap: LLMs should augment established behavior change interactions, not replace them. A chatbot alone — no matter how eloquent — misses the full surface area of how people actually change behavior. The conversation is important, but so is the notification that catches you at the right moment, the garden that grows on your lock screen, and the weekly plan that reflects what you told the coach about your life.

Coaching quality vs. scalability. The gap in the upper-right is where technology-augmented coaching should live.

Why do most LLM-based health interventions fall short of their potential?

They focus exclusively on conversational chat and ignore decades of evidence-based HCI behavior change interactions like goal setting, tracking, ambient displays, and contextual nudges The LLMs aren't powerful enough to have good conversations Users don't want to talk to chatbots about health

Chapter 1: The Key Insight

Bloom's central thesis is simple but powerful: LLMs bring two capabilities that transform behavior change technology, and neither of them is "being a chatbot."

Capability 1: High-fidelity motivational interviewing

Motivational interviewing (MI) is a counseling approach that helps people find their own reasons to change. It relies on open-ended questions, affirmations, reflective listening, and summaries — collectively called OARS. Rule-based systems can't do MI because it requires understanding nuance, following conversational threads, and responding with genuine reflection. LLMs can.

Capability 2: Personalizing with qualitative context

Traditional health apps personalize using quantitative data: step counts, heart rate, calories burned. LLMs unlock a different kind of personalization — using qualitative context. When you tell the coach "I want to have more energy to play with my baby," that becomes a lever for every interaction in the app. The push notification doesn't say "you're 2,000 steps short" — it says "how about a 10-minute walk after lunch? You mentioned wanting more energy for your little one."

The paradigm shift: The LLM is not the intervention. The LLM is the amplifier of existing evidence-based interventions. The coaching conversation simultaneously serves as (1) an intervention in its own right (MI is therapeutic) and (2) a context-gathering mechanism that makes every other UI element more personal, more relevant, and more motivating.

This is the key design principle behind Bloom: use the LLM to augment six established behavior change interactions, not to replace them with a chat window.

What are the two key LLM capabilities that Bloom leverages for behavior change?

High-fidelity motivational interviewing conversations AND personalizing interactions using qualitative context (goals, values, barriers) gathered from those conversations Generating exercise routines AND counting steps Natural language understanding AND text generation

Chapter 2: The Bloom System

Bloom is an iOS app built on the Active Choices program from Stanford School of Medicine — an evidence-based physical activity intervention that has been validated in clinical settings. The app wraps six behavior change interactions around an LLM health coach named Beebo.

The six interactions

Goal Setting — Collaborative, flexible goals set through coaching conversations (not rigid number pickers)
Action Planning — Weekly activity plans generated from coaching context, editable by the user
Activity Tracking — Apple HealthKit integration for steps, active minutes, workouts
Data Visualization — Charts and progress displays with LLM-narrated insights
Ambient Display — A virtual garden on the lock screen that grows with physical activity
Push Notifications — LLM-crafted contextual nudges using qualitative user data

System architecture

The technical stack is straightforward: a native Swift iOS app communicates with Firebase (authentication, Firestore database, Cloud Functions) which orchestrates calls to GPT-4o-mini with tool calling. Apple HealthKit provides wearable data. The LLM has access to user context (goals, barriers, conversation history, activity data) through structured tool calls.

Why GPT-4o-mini? Cost and latency. Health coaching conversations need to feel responsive. GPT-4o-mini provides adequate conversational quality at a fraction of the cost and latency of larger models. For a research prototype serving 54 participants over 4 weeks, this was the pragmatic choice.

The Bloom system architecture. The LLM coaching conversation feeds qualitative context into all six behavior change interactions.

What clinical program forms the evidence base for Bloom's intervention design?

The Active Choices program from Stanford School of Medicine A custom program designed by the paper authors The WHO physical activity guidelines

Chapter 3: Beebo — The LLM Health Coach

Beebo is the LLM-powered health coaching chatbot at the center of Bloom. The name, the personality ("warm, non-judgmental, encouraging"), and even the avatar were designed deliberately — the authors found that giving the chatbot relational characteristics significantly increased user engagement compared to generic "AI assistant" framing.

Three conversation types

Onboarding: A structured 20-30 minute conversation when the user first opens the app. Beebo learns about the user's current activity level, barriers, motivations, and preferences. This conversation populates the qualitative context that powers the entire system.

Weekly check-in: Every week, Beebo initiates a conversation reviewing the past week's activity, celebrating wins, troubleshooting barriers, and co-creating the next week's plan. These are the backbone of the coaching relationship.

At-will chat: Users can message Beebo anytime. Questions about exercise form, venting about a bad day, asking for workout ideas — the LLM handles the full range of what a human coach would field.

Motivational interviewing with OARS

Before generating each response, the system performs a strategy prediction step using few-shot classification. Given the conversation history, it predicts which MI strategy to employ:

Open questions — "What does being more active look like for you?"
Affirmations — "That 15-minute walk despite the rain shows real commitment."
Reflections — "It sounds like mornings work better because evenings feel unpredictable."
Summaries — "So your main goals are having more energy and sleeping better, and walking is your preferred activity."

Tool calling

Beebo isn't just a text generator. Through GPT-4o-mini's function calling API, it can:

Edit the user's weekly action plan (add, modify, or remove activities)
Query Apple HealthKit data (steps, active minutes, workout history)
Retrieve stored user context (previously stated goals, barriers, preferences)
Update the user's profile with new information gathered during conversation

The dual purpose of conversation: Every coaching conversation serves two functions simultaneously. First, it's a therapeutic interaction in its own right — motivational interviewing is an evidence-based counseling technique. Second, it's a data collection mechanism. When a user says "I've been stressed about the move," that qualitative context flows into push notifications, plan adjustments, and garden celebrations. The conversation isn't just talking — it's listening and feeding what it hears into the rest of the system.

Interactive conversation flow showing OARS strategy labels. Click strategies to see example exchanges.

What does the OARS framework stand for in motivational interviewing?

Open questions, Affirmations, Reflections, and Summaries Outcomes, Actions, Rewards, and Schedules Observe, Analyze, Respond, and Synthesize

Chapter 4: LLM-Augmented Interactions

This chapter covers how Bloom uses LLM-generated qualitative context to enhance four key behavior change interactions. Each one would exist without the LLM — but the LLM makes it personal.

Action Planning

Traditional apps offer generic templates: "Walk 30 minutes, 3x per week." Bloom's approach is different. After the onboarding conversation, the LLM generates a personalized weekly plan based on everything the user shared: their schedule, preferences, constraints, and goals.

A user who mentioned they enjoy nature walks and have a free lunch hour might get: "Tuesday 12:15 PM — 20-min nature walk on the campus trail" rather than "Tuesday — Walk 30 min." The plan is editable, and weekly check-ins refine it iteratively.

Push Notifications

Most health app notifications are one-size-fits-all: "Don't forget to exercise today!" Bloom's notifications use the LLM to craft messages grounded in the user's stated values and context:

"I know you mentioned wanting more energy for your baby — how about a 10-min walk after lunch?"
"You said rainy days make it hard to get outside. How about that YouTube yoga video you mentioned?"

These notifications are generated in batch by the LLM using the user's conversation history and activity context. The qualitative grounding makes them feel less like spam and more like a friend who remembers what you told them.

Ambient Display: The Garden

Bloom's lock screen widget shows a virtual garden that grows with physical activity. Each completed activity adds a plant or flower. The garden itself is a standard ambient display — the HCI literature has decades of evidence that ambient feedback promotes behavior change.

The LLM augmentation: when the garden grows, Beebo generates a celebratory message linking the growth to the user's personal achievements. Not "Great job, you completed an activity!" but "Your garden grew a sunflower! That morning walk you took before the baby woke up is paying off."

Data Insights

Raw numbers from wearables can be motivating or demoralizing depending on framing. The LLM narrates wearable data trends, connecting numbers to personal goals: "You averaged 7,200 steps this week, up from 5,800 last week. That's getting closer to your goal of walking more for energy."

The pattern across all four: In every case, the LLM takes an established interaction (plans, notifications, ambient displays, data summaries) and personalizes it using qualitative context from coaching conversations. The interaction pattern is evidence-based. The personalization is LLM-enabled. Neither alone would be as effective.

How does Bloom's approach to push notifications differ from typical health app notifications?

The LLM crafts notifications grounded in the user's qualitative context (stated values, goals, barriers) from coaching conversations, making them personally relevant rather than generic They are sent more frequently throughout the day They include step count data from the wearable

Chapter 5: Safety & Red-Teaming

Deploying an LLM in a health context demands serious safety engineering. Users might ask Beebo for medical advice, share suicidal ideation, or attempt to manipulate the system into harmful recommendations. Bloom addresses this with a three-layer safety architecture.

Layer 1: Content filter

Before any response is sent to the user, a content filter evaluates whether the response contains harmful health advice. If the LLM suggests a specific diet, recommends a supplement, or provides any information that should come from a medical professional, the filter intercepts and rewrites the response to something safe: "That's a great question for your doctor — they'd know best based on your medical history."

Layer 2: Scope filter

Users frequently try to use health chatbots as general medical advisors. "Is this chest pain normal?" or "Should I stop taking my medication?" The scope filter detects out-of-scope medical questions and redirects: "I'm here to help with your physical activity goals. For medical questions, please reach out to your healthcare provider."

Layer 3: Crisis filter

If a user expresses suicidal ideation, self-harm intent, or severe distress, the crisis filter triggers immediate provision of crisis resources (988 Suicide & Crisis Lifeline, Crisis Text Line) and exits the coaching conversation flow. This is not optional — it overrides all other system behavior.

Validation: 600-example benchmark + red-teaming

The team created a 600-example safety benchmark dataset spanning all three filter categories. Each example is a user message paired with the expected system behavior (allow, redirect, or crisis response). The system achieves 94% accuracy on this benchmark.

Additionally, 4 domain experts conducted red-teaming sessions, deliberately trying to break the safety filters through prompt injection, topic drift, edge cases, and adversarial conversation patterns. Findings from red-teaming were used to iteratively improve the filters before deployment.

Safety in context: 94% accuracy on a safety benchmark means roughly 6% of adversarial inputs could slip through. For a research prototype with 54 participants and regular researcher oversight, this was deemed acceptable. For a production deployment at scale, the bar would need to be significantly higher. The paper is transparent about this limitation.

What are the three layers of Bloom's safety architecture?

Content filter (refuse harmful health advice), scope filter (redirect medical questions), and crisis filter (provide crisis resources for distress/suicidal ideation) Input validation, output moderation, and human review Token limits, temperature capping, and response length control

Chapter 6: Study Design — 4-Week RCT

The evaluation is a between-subjects randomized controlled trial — the gold standard for causal inference in HCI research.

Participants

N = 54 adults, recruited from a university community. Inclusion criteria: owns an iPhone and Apple Watch, currently below recommended physical activity levels (self-report), no contraindications to exercise. Participants were randomized into two conditions.

LLM condition (treatment)

Full Bloom app with all features: Beebo chatbot (onboarding, weekly check-ins, at-will chat), LLM-generated action plans, LLM-crafted push notifications, LLM-narrated data insights, the ambient garden with LLM celebrations, and standard activity tracking.

Control condition

The same app with all LLM features removed. Participants still got activity tracking, data visualization, and the garden — but no chatbot, no LLM-generated plans (they created their own), no LLM-crafted notifications (standard generic ones instead), and no LLM-narrated insights (just raw numbers). This is a strong control because it isolates the LLM's contribution from the app's inherent value.

Measures

The study collected data from multiple sources:

Wearable data: Steps, active minutes, and workout logs from Apple Watch via HealthKit
Surveys: Pre-study, weekly, daily ecological momentary assessments (EMA), and post-study questionnaires measuring physical activity self-efficacy, enjoyment, mindsets
App usage logs: Session duration, feature usage, conversation length, plan completion rates
Semi-structured interviews: Post-study qualitative interviews about the user experience

Why mixed methods matter: If you only measure step counts, you might conclude the LLM adds nothing. If you only do interviews, you might overclaim based on enthusiastic anecdotes. By combining quantitative behavioral data, validated psychological scales, usage analytics, and qualitative interviews, the study can triangulate — finding effects that any single method would miss. This turns out to be critical for interpreting the results.

What makes the control condition in this study a strong comparison?

Control participants use the same app with the same non-LLM features (tracking, visualization, garden) — only the LLM components are removed, isolating the LLM's specific contribution Control participants use no app at all Control participants use a different health app

Chapter 7: Results — Mindsets > Metrics

The headline finding is surprising and nuanced. It's not the result you'd expect from a "does LLM work?" study.

Physical activity: both conditions improved

Both groups doubled the proportion of participants meeting the 150 minutes/week guideline. There was no clear advantage for the LLM condition in short-term physical activity levels. If you stopped here, you might write off the LLM as unnecessary overhead.

But mindsets diverged significantly

The LLM condition showed markedly different psychological outcomes:

Stronger beliefs that activity is beneficial — not just knowing it intellectually, but genuinely believing it for their own life
Greater enjoyment of physical activity — it stopped feeling like a chore
Expanded definition of "what counts" — users began counting gardening, walking to the store, playing with kids as real activity rather than dismissing anything short of a gym workout
More self-compassion when goals were missed — instead of guilt and giving up, users reframed missed days as normal and recommitted

Engagement: 5x more time in app

LLM users spent five times more time in the app than control users. They returned more often, used more features, and had longer sessions. The chatbot was the primary engagement driver, but it pulled users into the other features too.

Plans: more varied, slightly higher completion

LLM-generated plans were more diverse (more activity types, more specific timing, more contextually appropriate) and had slightly higher completion rates than self-created plans in the control condition.

Psychological outcomes comparison between LLM and control conditions. The divergence in mindsets despite similar short-term activity levels is the key finding.

The punchline: In 4 weeks, the LLM didn't make people walk more. It changed how they think about walking. It planted seeds for sustainable behavior change rather than producing short-term spikes. This is consistent with motivational interviewing theory: MI works by shifting intrinsic motivation, which takes time to translate into sustained behavior change.

What is the key finding from the 4-week RCT?

While both conditions improved short-term activity equally, the LLM condition produced stronger psychological shifts (beliefs, enjoyment, self-compassion, expanded definitions of activity) that may drive longer-term change The LLM condition produced significantly more physical activity Neither condition showed any improvement

Chapter 8: Design Implications

The results generate several design insights that extend beyond physical activity to any LLM-augmented behavior change system.

LLMs shift mindsets more than short-term behavior

Think of LLM health coaching as "planting seeds." The motivational interviewing conversations help users discover their own reasons for change, reframe failures as learning, and expand what "counts." These are precursors to sustained behavior change, not the behavior change itself. A 4-week study may be too short to see the downstream effects on activity levels, but the psychological foundation is being laid.

Qualitative context enables agency-promoting interactions

Because the LLM knows why you want to be active (not just how many steps you took), it can craft interactions that promote autonomy rather than prescriptive compliance. "You mentioned wanting to explore your neighborhood more — what if tomorrow's walk went down that street you've been curious about?" This respects the user's agency in a way that "walk 10,000 steps" never can.

Relational cues: powerful but risky

Naming the chatbot "Beebo," giving it a warm personality, and designing it to remember personal details dramatically increased engagement. Users described Beebo as "a friend," "someone who gets me," and "my accountability partner." But this introduces attachment and overreliance risks. Some users became emotionally dependent. Others felt guilty "letting Beebo down." The line between healthy engagement and unhealthy attachment is blurry, and designers must navigate it carefully.

The "give and take" dynamic

For the LLM to be useful, users must share context. But sharing is effortful. The study found a self-reinforcing cycle: users who engaged deeply in onboarding got more personalized interactions, which motivated more engagement, which provided more context. Conversely, users who gave minimal input got generic responses, which felt less valuable, which discouraged further sharing. Designing the initial onboarding to be intrinsically rewarding (not just data extraction) is critical.

The engagement-attachment spectrum. Relational cues increase engagement but introduce overreliance risks that designers must manage.

The design principle: LLMs are amplifiers of existing HCI interactions, not standalone solutions. The behavior change toolkit already exists — goal setting, tracking, ambient displays, nudges. The LLM's job is to make each interaction more personal, more contextual, and more attuned to the user's qualitative experience. Strip away the LLM, and you still have a functional app. Add the LLM, and every interaction becomes individually meaningful.

What risk do relational cues (naming the chatbot, personality design) introduce?

While they dramatically increase engagement, they can create emotional attachment and overreliance — users may feel guilty about "letting the chatbot down" or become emotionally dependent on it Users stop using the app because the chatbot feels fake The chatbot becomes too expensive to run

Chapter 9: Connections

Motivational interviewing

MI was developed by Miller and Rollnick in the 1980s for substance abuse treatment. The OARS framework (open questions, affirmations, reflections, summaries) is its practical backbone. Bloom demonstrates that LLMs can approximate MI strategies well enough to shift user mindsets, though the authors note that LLM-based MI hasn't been validated against human MI practitioners in controlled settings.

Transtheoretical Model (stages of change)

Prochaska and DiClemente's model describes behavior change as progressing through stages: precontemplation, contemplation, preparation, action, maintenance. Bloom's results suggest the LLM accelerates movement through early stages (contemplation to preparation) by shifting mindsets, even when short-term behavior (action stage) doesn't yet differ.

GPTCoach

An earlier chatbot from some of the same authors that used GPT-4 for physical activity coaching but without the integrated behavior change interactions. Bloom is the evolution: same conversational capability, but now augmenting a full suite of evidence-based interactions rather than standing alone.

Self-determination theory

SDT (Deci & Ryan) identifies three psychological needs for sustained motivation: autonomy (feeling in control), competence (feeling capable), and relatedness (feeling connected). Bloom's design touches all three: personalized plans promote autonomy, achievement celebrations build competence, and Beebo's relational design creates a sense of relatedness.

Just-in-time adaptive interventions (JITAIs)

JITAIs deliver the right intervention at the right time based on the user's current state. Bloom's LLM-crafted push notifications are a form of JITAI — they use qualitative context to determine not just when to nudge but what to say and how to frame it.

Ambient displays and persuasive technology

Bloom's garden metaphor builds on decades of ambient display research (e.g., Consolvo et al.'s UbiFit Garden). The LLM augmentation adds a narrative layer — the garden doesn't just grow, it grows with a personalized story.

Cheat sheet — 6 interactions and how LLMs augment each:

Goal Setting	Collaborative goals surfaced through MI conversations
Action Planning	Personalized plans from qualitative context, not templates
Tracking	HealthKit data + LLM interpretation of trends
Visualization	LLM narrates numbers with personal goal context
Ambient Display	Garden growth + LLM celebratory messages
Notifications	Qualitative-context-grounded nudges

According to self-determination theory, which three psychological needs must be met for sustained motivation?

Autonomy (feeling in control), competence (feeling capable), and relatedness (feeling connected) Reward, punishment, and repetition Knowledge, skill, and practice