AI/GenAI Scrum Master

Chapter 0: What an AI Scrum Master IS

It is Monday morning. Your ML team's two-week sprint just ended. The results: one experiment hit target accuracy but needs 4x the GPU budget to serve in production. Another experiment failed completely — the model overfit to training data and generalizes poorly. A third experiment is "promising" but the researcher wants two more weeks to try a different architecture. The product manager is asking why the chatbot feature isn't ready for the demo on Friday. The VP wants a "timeline to ship."

You are the person who makes sense of this chaos. Not by writing code. Not by training models. But by creating the environment where a team of brilliant, sometimes chaotic researchers and engineers can do their best work — and ship it.

This is not traditional Scrum. Traditional Scrum assumes you can break work into user stories, estimate them in story points, and deliver a predictable increment every two weeks. AI projects violate every one of those assumptions. Experiments fail 80% of the time. "Done" is a spectrum of accuracy, not a checkbox. Training runs take days, not hours. And the most important work often looks like someone staring at a loss curve for three hours.

The AI Scrum Master is a new kind of role that sits at the intersection of technical program management, research facilitation, and organizational translation. You don't need to be able to write a transformer from scratch. But you need to understand enough to know when an experiment is stuck, when a researcher is chasing a dead end, and when a "90% accurate" model is actually terrible for the use case.

Responsibility	What you own	Daily intersection
Sprint Facilitation	Experiment-based planning, hypothesis-driven tickets, adaptive velocity	Every standup, every planning session, every retro
Experiment Governance	Kill/continue decisions, resource allocation, experiment tracking visibility	Researchers need GPUs, stakeholders need progress reports
Data Pipeline Coordination	Data readiness, annotation quality, labeling vendor management	Data is the #1 blocker — you unblock it
Research-to-Prod Bridge	Handoff checklists, technical debt tracking, ML engineering coordination	The notebook-to-production gap kills more AI projects than bad models
Stakeholder Translation	Uncertainty communication, timeline management, expectation setting	Executives don't understand why "the model isn't ready yet"

The experiment-driven vs feature-driven tension. In traditional software, you plan features and ship them. In AI, you plan experiments and learn from them. An experiment that proves an approach doesn't work is a SUCCESS — it saved you months of building the wrong thing. Your job is to create a system where the team can fail fast, learn fast, and converge on what works. The moment you treat AI work like feature work, you lose.

A Day in the Life

Here's what a typical Wednesday looks like for an AI Scrum Master at a GenAI startup:

Time	Activity	Skills used
9:00 AM	Check overnight training runs: one converged, one diverged (NaN loss at epoch 47)	Experiment tracking, debugging
9:30 AM	Standup: researcher reports eval regression after data refresh, ML engineer blocked on GPU quota	Sprint facilitation, resource management
10:00 AM	Triage the eval regression: new labeling batch has quality issues, escalate to vendor	Data pipeline management
11:00 AM	Sprint planning prep: write hypothesis-driven tickets for next sprint's RAG experiments	Experiment-based planning
12:00 PM	Stakeholder sync: explain why the chatbot needs another sprint — it hallucinates on edge cases	Stakeholder communication
2:00 PM	Review handoff checklist: model is ready for production but needs latency optimization	Research-to-prod bridge
3:00 PM	Safety review: new agent can now execute code — needs guardrails before deployment	Risk management, responsible AI
4:00 PM	Retro: discuss why the last sprint's velocity was half the forecast (GPU shortage + scope creep)	Process improvement

The Experiment Funnel

The diagram below shows the flow of work in an AI team. Unlike a traditional software pipeline where features flow left-to-right with predictable completion, experiments flow through a funnel where most are killed. This is healthy.

1. Hypothesis Formation

Researcher proposes: "Fine-tuning Llama-3 on our domain data will improve accuracy from 72% to 85% on our eval set." This is the ticket. Not "build chatbot."

↓

2. Experiment Design

Define: dataset, model, hyperparameters, eval metrics, success threshold, compute budget, time box. All BEFORE training starts.

↓

3. Execution

Training runs, prompt engineering iterations, RAG pipeline experiments. Daily check-ins on loss curves and early metrics.

↓

4. Evaluation

Run eval suite. Compare to baseline. Check for regressions on existing capabilities. Document results in experiment log.

↓

5. Decision Gate

KILL (didn't meet threshold), ITERATE (promising, needs refinement), or PROMOTE (ready for production pipeline).

AI Team Experiment Funnel

Watch experiments flow through the funnel. Most are killed — that's healthy. Click Add Experiment to inject new hypotheses. Click Kill Stale to remove experiments that exceeded their time box.

Interview Dimensions

Staff-level interviews test you across five dimensions. Each chapter in this lesson maps to one or more:

Dimension	What they ask	Chapters
CONCEPT	"Explain why story points don't work for ML research"	All
DESIGN	"Design a sprint process for a team building an LLM-powered product"	0, 1, 2, 8, 11
CODE	"Show me the Jira board / experiment tracker / eval dashboard you'd set up"	2, 4, 5, 6, 8
DEBUG	"Your team hasn't shipped anything in 3 sprints. Diagnose the problem."	3, 7, 10
FRONTIER	"How will AI-assisted project management change this role?"	All

Your AI team just completed a two-week sprint. Two out of three experiments failed to meet their accuracy thresholds. The product manager is frustrated because "nothing shipped." What is the correct framing of this sprint's outcome?

The sprint failed — the team needs to work harder next sprint The sprint succeeded — eliminating two dead-end approaches is valuable progress that narrows the solution space and redirects resources to the one promising direction The sprint was average — one out of three is a 33% hit rate The researchers should have predicted which experiment would work

Chapter 1: AI Project Lifecycle

A product manager shows you a roadmap: "Q1: build the model. Q2: integrate it. Q3: launch." You know immediately this will fail. AI projects don't work in sequential phases. The lifecycle is a loop, not a line — and the most important skill is knowing when to exit the loop.

The AI project lifecycle looks like this: research, prototype, evaluate, iterate, and — only when evaluation meets production criteria — ship. But here's what makes it different from traditional software: you might loop through research-prototype-evaluate fifteen times before anything is production-ready. And three of those loops might end in "this approach fundamentally doesn't work, start over."

The Explore/Exploit Tradeoff

Every sprint, you face a decision borrowed from reinforcement learning: explore (try new approaches, architectures, datasets) or exploit (optimize what's already working). Early in a project, you should be 80% explore, 20% exploit. As you approach a deadline, it flips: 20% explore, 80% exploit.

The Scrum Master's job is to manage this ratio explicitly. If the team is exploring too late, they'll never ship. If they're exploiting too early, they'll ship a mediocre model because they never found the right approach.

Phase	Explore:Exploit	Sprint focus	Your role
Discovery (0-4 weeks)	90:10	Wide search — try 5 approaches, kill 4	Protect research time, resist pressure to "pick one and go"
Convergence (4-8 weeks)	50:50	2-3 approaches competing, deeper experiments	Set decision gates, track eval metrics, prepare kill criteria
Optimization (8-12 weeks)	10:90	One approach, hyperparameter tuning, edge cases	Track diminishing returns, push for production readiness
Productionization (12-16 weeks)	5:95	Latency, reliability, monitoring, deployment	Coordinate ML eng + infra, handoff checklists, launch planning

When to kill an experiment. This is the hardest decision in AI project management. Kill too early and you miss breakthroughs. Kill too late and you waste months. Use three criteria: (1) Has it consumed more than 2x its budgeted compute? Kill it. (2) Has it been stuck at the same accuracy for 3+ training runs with different hyperparameters? Kill it. (3) Is the gap between current and target accuracy larger than what any known technique could close? Kill it. Document every kill decision — it's organizational learning.

CONCEPT: Why Waterfall Fails for AI

Traditional waterfall assumes you can specify requirements upfront, build to those requirements, and test against them. AI violates all three assumptions:

yaml
# Traditional software: requirements are deterministic
requirement: "When user clicks 'Submit', save form data to database"
test: "Assert DB contains the submitted data after click"
estimate: "3 story points (1-2 days)"

# AI project: requirements are probabilistic
requirement: "Chatbot answers customer questions accurately"
test: "What does 'accurately' mean? 90%? 95%? On which questions?
       Measured how? By whom? What's the baseline?"
estimate: "Unknown. Could be 2 weeks or 6 months depending on
           data quality, model choice, and what 'accurate' means."

DESIGN: The AI Project Canvas

Before any sprint planning, create an AI Project Canvas — a one-page document that aligns the team on what success looks like:

markdown
# AI Project Canvas: Customer Support Chatbot

## Business Objective
Reduce ticket volume by 30% by deflecting common questions to AI.

## Success Metrics (ordered by priority)
1. Deflection rate: % of conversations resolved without human handoff
2. Customer satisfaction: CSAT score >= 4.2/5 on AI-handled conversations
3. Accuracy: Correct answer rate >= 92% on eval set (500 questions)
4. Latency: p95 response time < 2 seconds

## Data Availability
- 50K historical support tickets (labeled by category, resolution)
- 2K manually curated Q&A pairs for eval
- Real-time access to product docs (RAG source)
- GAP: No labeled "bad answer" examples for safety eval

## Technical Constraints
- Budget: $5K/month inference cost (rules out GPT-4 at scale)
- Latency: Must respond in < 2s (rules out chain-of-thought with 5 LLM calls)
- Privacy: No customer PII sent to external APIs (rules out OpenAI for EU customers)
- Infra: Kubernetes cluster with 4x A100 GPUs available

## Known Risks
- Hallucination on product-specific questions (mitigation: RAG + citations)
- Data drift as product changes (mitigation: weekly eval re-runs)
- Regulatory review needed before launch (blocker: 3-week legal review)

## Time Box
- Discovery: 3 sprints (6 weeks)
- Ship MVP: Sprint 5 (week 10)
- Kill criteria: If accuracy < 80% after sprint 3, pivot approach

CODE: Project Lifecycle Tracker

python
# ai_lifecycle_tracker.py — Track project phase and explore/exploit ratio
from dataclasses import dataclass, field
from datetime import datetime, timedelta
from enum import Enum
from typing import List

class Phase(Enum):
    DISCOVERY = "discovery"
    CONVERGENCE = "convergence"
    OPTIMIZATION = "optimization"
    PRODUCTION = "production"

@dataclass
class Experiment:
    hypothesis: str
    status: str = "active"        # active | killed | promoted
    start_date: datetime = None
    compute_budget_hrs: float = 0
    compute_used_hrs: float = 0
    best_metric: float = 0.0
    metric_history: List[float] = field(default_factory=list)

    def should_kill(self, target_metric: float) -> tuple[bool, str]:
        # Rule 1: Over budget
        if self.compute_used_hrs > 2 * self.compute_budget_hrs:
            return True, "2x compute budget exceeded"
        # Rule 2: Stuck — last 3 runs within 1% of each other
        if len(self.metric_history) >= 3:
            recent = self.metric_history[-3:]
            if max(recent) - min(recent) < 0.01:
                return True, "Metric plateau (3 runs within 1%)"
        # Rule 3: Gap too large
        if self.best_metric > 0 and (target_metric - self.best_metric) > 0.15:
            return True, f"Gap to target ({target_metric - self.best_metric:.1%}) too large"
        return False, "Continue"

DEBUG: When the Lifecycle Stalls

Signs your project is stuck in the wrong phase:

Symptom	Likely cause	Fix
Still exploring after 8 weeks	No kill criteria defined — team can't commit	Set a hard decision gate: "By sprint 5, we pick one approach"
Optimizing a 75% model to 77%	Wrong phase — should still be exploring	Step back. Is 77% good enough? If not, a different architecture might get 90%
Never shipping "because we can improve it"	Perfectionism / fear of production failures	Ship with guardrails. A 90% model with good fallbacks beats a 95% model that never ships
Researcher working alone for 3 weeks	No visibility into experiment progress	Daily experiment stand-ups. "What did you try? What did you learn? What's next?"

FRONTIER: AI-Assisted Lifecycle Management

The frontier is using AI to manage AI projects. Tools emerging now:

Auto-experiment scheduling: Systems like Determined AI and Vertex AI Experiments automatically schedule hyperparameter searches, track results, and surface the Pareto-optimal runs (best accuracy vs. compute cost). The Scrum Master reviews the dashboard instead of asking researchers for updates.

LLM-powered retrospectives: Feed your sprint's experiment logs, Slack discussions, and PR descriptions into an LLM. Ask it: "What patterns do you see in our failed experiments? What should we try next?" The LLM becomes a research advisor that remembers everything.

Explore / Exploit Balance Tracker

Adjust the slider to see how the explore/exploit ratio should shift across project phases. The bar chart shows recommended time allocation.

Project Week 1

Your team is in week 10 of a 16-week AI project. The best model so far achieves 84% accuracy against a 90% target. The researcher proposes trying a completely different architecture (a vision transformer instead of a CNN). What should you do?

Approve it — exploration is always valuable Reject it — you should be exploiting, not exploring at week 10 Time-box it to one sprint with a clear kill criterion — if the ViT doesn't beat 84% within one sprint, kill it and optimize the CNN Ask the PM to extend the timeline

Chapter 2: Sprint Planning for AI Teams

The team lead says: "This story is 5 points." You ask what the story is. "Fine-tune the model on the new dataset." You ask: how long will training take? "Depends on the data." What accuracy do you expect? "We'll know when we try." Will it work? "Maybe." This is not estimable in story points. And pretending it is creates false confidence that wrecks your sprint velocity and demoralizes the team.

Story points don't work for research. Story points assume you can estimate relative complexity of tasks that you've done before. But in AI, most experiments are novel. You've never fine-tuned this model on this data with these hyperparameters before. The uncertainty is fundamental, not an estimation failure.

CONCEPT: Hypothesis-Driven Tickets

Replace user stories with experiment tickets. Each ticket is a testable hypothesis with a clear accept/reject criterion:

markdown
# Traditional User Story (DOESN'T WORK for AI)
## Story: Improve chatbot accuracy
As a customer, I want the chatbot to answer my questions correctly
so that I don't have to wait for a human agent.
Acceptance: Chatbot answers questions correctly.
Points: 8

# Hypothesis-Driven Experiment Ticket (USE THIS)
## EXP-042: Fine-tune Llama-3-8B on support corpus
**Hypothesis:** Fine-tuning Llama-3-8B on our 50K support ticket
corpus will improve answer accuracy from 72% (baseline: zero-shot)
to >= 85% on the 500-question eval set.

**Method:**
- Dataset: 50K tickets, 80/10/10 train/val/test split
- Model: Llama-3-8B, QLoRA (rank 16, alpha 32)
- Training: 3 epochs, lr=2e-4, batch_size=4, gradient_accumulation=8
- Eval: accuracy, F1, latency on test set

**Compute budget:** 24 GPU-hours (1x A100, ~1 day)
**Time box:** 1 sprint (2 weeks including eval and documentation)

**Success criteria:**
- >= 85% accuracy on eval set (primary)
- Latency < 500ms per response (secondary)
- No regression on safety eval (blocking)

**Kill criteria:**
- Accuracy < 78% after full training = KILL
- Training loss doesn't decrease after epoch 1 = KILL (data issue)

**Outcome:** [TO BE FILLED AFTER EXPERIMENT]

"We don't know if this will work" is a valid estimate. The honesty is the value. When a researcher says "I think there's a 40% chance this works," that's actionable information. You can plan around it: run two experiments in parallel, have a fallback, or adjust stakeholder expectations. The worst thing you can do is force a researcher to commit to an outcome they can't guarantee.

DESIGN: The AI Sprint Board

A traditional sprint board has: To Do, In Progress, Done. For AI teams, you need more columns that reflect the experiment lifecycle:

Column	What lives here	Exit criteria
Backlog	Hypotheses not yet prioritized	Team agrees to run it this sprint
Hypothesis	Experiment designed, compute budget approved	Dataset ready, baseline measured
Experiment	Training/running in progress	Run complete, results logged
Evaluation	Results being analyzed against success criteria	Kill/iterate/promote decision made
Iteration	Promising experiment being refined	Meets success criteria or killed
Production	Model promoted to production pipeline	Deployed, monitored, signed off
Killed	Experiments that didn't meet threshold	Documented with learnings

CODE: Jira / Linear Configuration

yaml
# jira_ai_sprint_config.yaml
# Custom issue types for AI teams

issue_types:
  - name: Experiment
    icon: flask
    fields:
      - hypothesis: text          # What we're testing
      - method: text              # How we'll test it
      - baseline_metric: number   # Current best
      - target_metric: number     # What we need
      - compute_budget: text      # "24 GPU-hours"
      - time_box: text            # "1 sprint"
      - kill_criteria: text       # When to stop
      - outcome: select           # killed | promoted | iterating
      - final_metric: number      # What we achieved
      - learnings: text           # What we learned (REQUIRED on close)
      - wandb_link: url           # Link to experiment tracking

  - name: Data Task
    icon: database
    fields:
      - data_source: text
      - volume: text              # "10K examples"
      - quality_gate: text        # "Inter-annotator agreement > 0.8"
      - blocking_experiments: link # Which experiments need this data

  - name: ML Engineering
    icon: gear
    fields:
      - type: select              # infra | optimization | deployment
      - model_artifact: text      # Which model version
      - latency_target: text
      - cost_target: text

workflow:
  Experiment:
    - Backlog -> Hypothesis       # Prioritized in sprint planning
    - Hypothesis -> Experiment    # Dataset ready, baseline set
    - Experiment -> Evaluation    # Training complete
    - Evaluation -> Killed        # Didn't meet criteria
    - Evaluation -> Iteration     # Promising, needs refinement
    - Evaluation -> Production    # Meets all criteria
    - Iteration -> Evaluation     # Re-evaluate after refinement

# Sprint velocity is measured in EXPERIMENTS COMPLETED, not story points.
# A "completed" experiment is one with a kill/promote decision + documented learnings.

DESIGN: Sprint Ceremonies Adapted for AI

Ceremony	Traditional	AI Adaptation
Sprint Planning	Estimate stories, commit to scope	Prioritize experiments by expected information gain. Commit to running N experiments, not to outcomes.
Daily Standup	"What did you do? What will you do? Blockers?"	"What did you learn? What's your next experiment? Are you stuck on data/compute/clarity?"
Sprint Review	Demo features to stakeholders	Share experiment results. Show eval dashboards. Explain what we learned, not just what we built.
Retrospective	Process improvement	Process improvement + EXPERIMENT REVIEW: Which kills were good calls? Which should we have killed earlier? Are our hypotheses getting better?

DEBUG: When Sprint Planning Breaks

Symptom	Root cause	Fix
Velocity is wildly inconsistent	Measuring in story points on uncertain work	Switch to experiment throughput: completed experiments per sprint
Team always "almost done"	No time boxes on experiments	Every experiment gets a hard time box. At the deadline: kill, iterate, or promote.
Planning takes 4 hours	Trying to fully specify experiments upfront	Specify hypothesis + success criteria only. Method details emerge during execution.
Researchers skip planning	They see it as bureaucracy that slows research	Make planning about THEIR priorities. "What do YOU want to try? What do you need from the team?"

FRONTIER: AI-Assisted Sprint Planning

LLM-powered ticket generation: Feed your experiment log to an LLM: "Given these 20 completed experiments and their outcomes, suggest the top 5 most promising next experiments ranked by expected information gain." The LLM can spot patterns humans miss — like the fact that all successful experiments used a learning rate below 1e-4.

Sprint Capacity Planner

Allocate your team's sprint capacity across experiment types. Adjust sliders to balance exploration, optimization, data work, and ML engineering.

Exploration 40%

Optimization 30%

Data Work 20%

ML Engineering 10%

A researcher estimates an experiment at "3 story points." When you ask what could go wrong, they say: "The training might diverge, the dataset might have label noise, or the model might overfit. Any of those would require starting over with a different approach." What's the right response?

Add buffer points — make it 8 story points to account for risk Replace the story point estimate with a hypothesis ticket: time-box to one sprint, define kill criteria for each failure mode, and measure sprint velocity by experiments completed rather than points delivered Ask the researcher to break it into smaller stories Accept the 3-point estimate and track it normally

Chapter 3: Managing Non-Determinism

Your QA engineer runs the chatbot test suite. Monday: 91% pass rate. Tuesday, same code, same model, same test suite: 88% pass rate. Wednesday: 93%. Nothing changed. The model is non-deterministic — given the same input, it can produce different outputs depending on sampling temperature, random seeds, and floating-point arithmetic. Welcome to the world where "it works on my machine" is literally true only that one time.

Non-determinism is the defining challenge that separates AI project management from traditional software. In traditional software, a test either passes or fails. In AI, a test passes 91% of the time, and your job is to decide if that's good enough.

CONCEPT: Defining "Done" on a Spectrum

In traditional Scrum, "done" is binary: the feature works or it doesn't. In AI, "done" is a set of thresholds across multiple dimensions:

yaml
# Definition of Done for AI features

model_quality:
  accuracy: ">= 90% on eval set (500 examples)"
  precision: ">= 88% (false positives are costly)"
  recall: ">= 85% (false negatives are acceptable)"
  latency_p95: "< 500ms"
  consistency: "Variance < 3% across 5 eval runs with different seeds"

safety:
  toxicity: "< 0.1% toxic outputs on safety eval (1000 adversarial prompts)"
  hallucination: "< 5% hallucinated facts on factual eval"
  pii_leakage: "0% PII in outputs"

regression:
  no_regression: "All metrics within 2% of previous production model"
  backward_compat: "Existing integrations produce equivalent outputs"

operational:
  monitoring: "Eval metrics dashboarded and alerting configured"
  rollback: "Can revert to previous model version in < 5 minutes"
  documentation: "Model card completed with known limitations"

Accuracy is a spectrum, not a checkbox. "The model is 87% accurate" means nothing without context. 87% accuracy on what dataset? Measured how? With what confidence interval? Compared to what baseline? Your job is to make these questions automatic. Every experiment ticket specifies exactly what "good enough" means BEFORE the experiment starts.

DESIGN: Eval-Driven Acceptance Criteria

Replace the traditional "acceptance criteria" (manual QA checkboxes) with eval-driven acceptance:

1. Define Eval Suite

Before the sprint starts, create a frozen eval set. Never train on it. This is your ground truth.

↓

2. Set Thresholds

For each metric, define pass/fail thresholds. These go in the experiment ticket as success criteria.

↓

3. Automate Eval

Run eval suite automatically after every training run. Results go to dashboard. No manual testing.

↓

4. Statistical Significance

Run eval 5x with different seeds. Improvement must be statistically significant (p < 0.05), not just "higher this one time."

↓

5. Regression Check

Compare against production model on the SAME eval set. Flag any regression > 2% on any metric.

CODE: Eval Pipeline Script

python
# eval_pipeline.py — Automated eval with statistical testing
import numpy as np
from scipy import stats
from dataclasses import dataclass
from typing import Dict, List

@dataclass
class EvalResult:
    metric_name: str
    scores: List[float]       # Multiple runs with different seeds
    threshold: float
    baseline_scores: List[float]

    @property
    def mean(self): return np.mean(self.scores)

    @property
    def std(self): return np.std(self.scores)

    @property
    def passes_threshold(self): return self.mean >= self.threshold

    @property
    def is_significant_improvement(self):
        # Welch's t-test: is the new model significantly better?
        t_stat, p_val = stats.ttest_ind(self.scores, self.baseline_scores,
                                         equal_var=False)
        return p_val < 0.05 and t_stat > 0

    @property
    def has_regression(self):
        # Is the new model significantly WORSE?
        t_stat, p_val = stats.ttest_ind(self.scores, self.baseline_scores,
                                         equal_var=False)
        return p_val < 0.05 and t_stat < 0

def run_eval_gate(results: List[EvalResult]) -> Dict:
    """Returns go/no-go decision for production promotion."""
    report = {"pass": True, "details": []}
    for r in results:
        detail = {
            "metric": r.metric_name,
            "mean": f"{r.mean:.3f} +/- {r.std:.3f}",
            "threshold": r.threshold,
            "passes": r.passes_threshold,
            "significant": r.is_significant_improvement,
            "regression": r.has_regression
        }
        if not r.passes_threshold or r.has_regression:
            report["pass"] = False
        report["details"].append(detail)
    return report

DEBUG: Regression Detection

The scariest moment in AI development: the new model is better on your target metric but worse on something you didn't check. This is silent regression.

Regression type	How to detect	Prevention
Metric regression	Eval suite catches it	Run FULL eval suite, not just target metric
Distribution shift	Model degrades on a subpopulation	Slice eval by category (e.g., test per language, per topic)
Latency regression	Bigger model = slower inference	Include latency in eval gate criteria
Safety regression	New model generates toxic content	Safety eval is a BLOCKING gate, not optional
Behavioral regression	Model answers differently but "correctly" — breaks downstream	Golden test set: 50 hand-picked examples that must match exactly

FRONTIER: Continuous Eval

Eval-as-CI: The frontier is treating model evaluation like continuous integration. Every commit to the training pipeline triggers an eval run. Results are reported as PR checks. A model can't be merged to production if eval scores drop. Tools like Evidently AI and Giskard are building this into standard MLOps workflows.

Eval Variance Visualizer

See how model accuracy varies across evaluation runs. Each run uses a different random seed. Adjust the variance slider to simulate different model stability levels. Green zone = passing threshold.

Model Variance 3%

Pass Threshold 90%

Your model scores 91% accuracy on Monday and 88% on Tuesday, with no code changes. The pass threshold is 90%. A stakeholder says "it passed Monday, ship it." What is the correct response?

Ship it — it passed once, that's good enough Wait for Wednesday's result to break the tie Run the eval 5 times with different seeds, compute mean and confidence interval. If the mean is above 90% with 95% confidence, ship. Otherwise, the model isn't reliably passing. Lower the threshold to 88%

Chapter 4: Data Pipeline Management

It is sprint planning. The ML engineer says: "I can start the experiment as soon as the labeled data is ready." The data lead says: "We sent 5,000 examples to the labeling vendor last week. They said 7-10 business days." The sprint is 10 business days. The experiment needs labeled data by day 3 to have time for training and evaluation. The math doesn't work. And nobody realized it until just now.

Data is the #1 blocker for AI teams. Not compute, not model architecture, not engineering talent. Data. Specifically: getting enough high-quality, correctly labeled data to the right team at the right time. Your job as AI Scrum Master is to make data readiness visible and plan around it.

CONCEPT: The Data Supply Chain

Think of data like a supply chain in manufacturing. You need raw materials (unlabeled data), processing (annotation), quality control (validation), and delivery (versioned datasets). Any disruption at any stage blocks everything downstream.

Stage	Lead time	Common blockers	Your mitigation
Collection	1-4 weeks	Legal approval for scraping, API rate limits, privacy review	Start collection 2 sprints before experiments need it
Annotation	1-3 weeks	Vendor capacity, unclear guidelines, low agreement	Write annotation guides BEFORE sending to vendors
Validation	2-5 days	Quality issues requiring re-labeling	Spot-check first 100 labels before approving full batch
Versioning	1 day	No versioning = "which dataset did you train on?"	DVC or similar tool, version every dataset change

Plan data 2 sprints ahead. If your experiments need new labeled data, you must initiate the data pipeline at least 2 sprints before the experiment sprint. This means your sprint planning needs a "data lookahead" section: "What data will sprint N+2 need, and what do we need to start NOW to have it ready?"

DESIGN: Data Readiness Board

Add a parallel track to your sprint board specifically for data:

yaml
# data_readiness_board.yaml

columns:
  - name: "Data Needed"
    description: "Experiment requires data that doesn't exist yet"
    cards_include: experiment_id, data_type, volume, deadline

  - name: "Collection"
    description: "Raw data being gathered"
    cards_include: source, method, legal_approval, eta

  - name: "Annotation"
    description: "Data sent to labeling vendor/team"
    cards_include: vendor, volume, guidelines_link, eta, cost

  - name: "QA"
    description: "Labeled data being validated"
    cards_include: sample_checked, agreement_score, issues_found

  - name: "Ready"
    description: "Versioned, validated, available in data store"
    cards_include: version, location, row_count, quality_score

# Key metric: Data Readiness Rate
# = (experiments with data ready on time) / (total experiments planned)
# Target: >= 80%. Below 60% = systemic planning failure.

CODE: Data Quality Monitoring

python
# data_quality_monitor.py — Track annotation quality in real time
import numpy as np
from collections import Counter
from dataclasses import dataclass
from typing import List, Dict

@dataclass
class AnnotationBatch:
    batch_id: str
    vendor: str
    total_examples: int
    labels: List[Dict]         # [{"id": "ex_001", "label": "positive", "annotator": "a1"}, ...]
    double_labels: List[Dict]  # Same examples labeled by 2 annotators

    @property
    def inter_annotator_agreement(self) -> float:
        """Cohen's kappa between annotators on double-labeled examples."""
        if not self.double_labels:
            return 0.0
        agreements = sum(
            1 for d in self.double_labels
            if d["label_a"] == d["label_b"]
        )
        p_observed = agreements / len(self.double_labels)
        # Simplified kappa (full version accounts for chance agreement)
        label_counts = Counter(d["label_a"] for d in self.double_labels)
        total = len(self.double_labels)
        p_chance = sum((c/total)**2 for c in label_counts.values())
        if p_chance == 1.0:
            return 1.0
        return (p_observed - p_chance) / (1 - p_chance)

    @property
    def label_distribution(self) -> Dict[str, float]:
        """Check for label imbalance."""
        counts = Counter(l["label"] for l in self.labels)
        total = sum(counts.values())
        return {k: v/total for k, v in counts.items()}

    def quality_report(self) -> Dict:
        iaa = self.inter_annotator_agreement
        dist = self.label_distribution
        issues = []
        if iaa < 0.6:
            issues.append(f"LOW AGREEMENT: kappa={iaa:.2f} (need >= 0.8)")
        for label, pct in dist.items():
            if pct > 0.9 or pct < 0.05:
                issues.append(f"IMBALANCE: '{label}' is {pct:.0%} of labels")
        return {
            "batch_id": self.batch_id,
            "agreement": iaa,
            "distribution": dist,
            "issues": issues,
            "status": "PASS" if not issues else "FAIL"
        }

DEBUG: When Data Blocks the Sprint

Problem	Diagnosis	Emergency fix	Systemic fix
Vendor missed delivery date	Scope was unclear, vendor underestimated	Use partial batch, reduce experiment scope	Send 10% pilot batch first, validate quality and timeline
Labels are low quality (kappa < 0.6)	Annotation guidelines are ambiguous	Re-label a subset with your own team	Create detailed guidelines with examples, run calibration sessions
Data has PII that blocks legal review	Nobody checked before sending to vendor	Apply PII scrubbing pipeline, re-annotate	PII check is a gate BEFORE annotation starts
"Which dataset did we train on?"	No data versioning	Hash the dataset file, log it in experiment tracker	DVC + data registry + version in every experiment config

FRONTIER: Synthetic Data and Active Learning

Synthetic data generation: Use LLMs to generate training data. GPT-4 can create thousands of labeled examples for pennies. The catch: synthetic data biases toward the generating model's worldview. Always validate synthetic data against real-world eval sets. Best practice: 80% real data + 20% synthetic for edge cases.

Active learning: Instead of labeling 10K random examples, train a model on 1K, then have it flag the examples it's most uncertain about. Label those first. Active learning can achieve the same accuracy with 3x less labeled data — saving weeks of annotation time.

Data Pipeline Timeline

Visualize the data pipeline stages and their lead times. Red sections indicate blockers. Green = on track. Click Add Blocker to simulate common disruptions.

Your team needs 10,000 labeled examples for next sprint's experiment. The labeling vendor says delivery will take 12 business days. Your sprint is 10 days. What is the BEST approach?

Start the data request NOW (in this sprint) so it arrives by next sprint day 2. Plan experiments that can start with the first 2,000 labels (delivered in batches) while the rest arrive. This is the data lookahead pattern. Wait for all 10,000 labels before starting any experiments Skip the labeled data and use synthetic data instead Ask the vendor to rush the order

Chapter 5: Experiment Tracking & Visibility

The VP asks: "How close are we to shipping the model?" You pull up the researcher's Jupyter notebook. There are 47 cells with names like "test_v3_final_FINAL_2." The loss curve is in a matplotlib plot embedded in cell 23. The best hyperparameters are in a comment on cell 31. The eval results are in a Slack message from last Thursday. This is not experiment tracking. This is chaos.

Experiment tracking is the practice of systematically recording every experiment's configuration, results, and artifacts so that anyone on the team (including future-you) can reproduce, compare, and build on past work. For the Scrum Master, it's also the source of truth for sprint progress. You don't ask "how's the experiment going?" — you look at the dashboard.

CONCEPT: The Experiment Log as Source of Truth

Every experiment produces three types of artifacts that must be tracked:

Artifact type	Examples	Why it matters
Configuration	Model, hyperparameters, dataset version, code commit	Reproducibility: can you rerun this exact experiment?
Metrics	Loss curves, accuracy, F1, latency, cost	Comparison: is this better than the last experiment?
Artifacts	Model weights, eval predictions, error analysis	Promotion: which model file goes to production?

The experiment log replaces the burndown chart. In traditional Scrum, the burndown chart shows progress. In AI Scrum, the experiment log is your burndown. Each row is a completed experiment. The metric column shows whether you're converging on your target. The decision column shows whether the team is making good kill/continue calls. This is what you show in sprint review.

DESIGN: Dashboards for Different Audiences

yaml
# experiment_dashboards.yaml — Three views, one data source

# 1. RESEARCHER VIEW (W&B / MLflow)
researcher_dashboard:
  charts:
    - loss_curves: "Training + validation loss over epochs, per experiment"
    - hyperparameter_sweep: "Parallel coordinates plot of lr, batch_size, etc."
    - confusion_matrix: "Per-class performance on eval set"
    - error_examples: "Top 20 hardest examples the model gets wrong"
  filters: [model_type, dataset_version, date_range]
  refresh: real-time

# 2. SCRUM MASTER VIEW (synthesized from W&B data)
scrum_dashboard:
  cards:
    - experiments_this_sprint: {total: 5, completed: 3, killed: 1, active: 1}
    - best_accuracy: {current: "88.3%", target: "90%", gap: "1.7%"}
    - compute_budget: {used: "72 GPU-hrs", total: "120 GPU-hrs", pct: "60%"}
    - data_readiness: {ready: 3, in_progress: 1, blocked: 1}
  table:
    columns: [experiment_id, hypothesis, status, best_metric, decision]
  refresh: hourly

# 3. STAKEHOLDER VIEW (executive summary)
stakeholder_dashboard:
  cards:
    - project_phase: "Convergence (Week 6 of 16)"
    - headline_metric: "Best model: 88.3% accuracy (target: 90%)"
    - confidence: "High — on track to hit 90% by week 10"
    - next_milestone: "Sprint 4 Review — May 30"
    - risks: "Data vendor delay on safety eval set"
  chart:
    - accuracy_over_time: "Weekly best accuracy with trend line"
  refresh: weekly

CODE: Experiment Logger Integration

python
# experiment_logger.py — Wraps W&B/MLflow for sprint tracking
import wandb
from datetime import datetime
from typing import Optional

class SprintExperimentLogger:
    def __init__(self, project: str, sprint_id: str):
        self.sprint_id = sprint_id
        self.project = project

    def start_experiment(self, config: dict) -> str:
        """Start a tracked experiment with sprint metadata."""
        run = wandb.init(
            project=self.project,
            config={
                **config,
                "sprint_id": self.sprint_id,
                "started_at": datetime.now().isoformat(),
                "hypothesis": config.get("hypothesis", "Not specified"),
                "compute_budget_hrs": config.get("compute_budget_hrs", 0),
                "kill_criteria": config.get("kill_criteria", "Not specified"),
            },
            tags=[f"sprint-{self.sprint_id}", config.get("model_type", "unknown")]
        )
        return run.id

    def log_decision(self, experiment_id: str, decision: str,
                     final_metric: float, learnings: str):
        """Log the kill/promote/iterate decision."""
        wandb.log({
            "decision": decision,           # killed | promoted | iterating
            "final_metric": final_metric,
            "learnings": learnings,
            "decided_at": datetime.now().isoformat()
        })
        # Also update the Jira ticket via API
        self._update_jira_ticket(experiment_id, decision, final_metric)

    def sprint_summary(self) -> dict:
        """Generate sprint review summary from experiment data."""
        api = wandb.Api()
        runs = api.runs(self.project,
                       filters={"config.sprint_id": self.sprint_id})
        summary = {
            "total": 0, "killed": 0, "promoted": 0,
            "iterating": 0, "active": 0,
            "best_metric": 0, "learnings": []
        }
        for run in runs:
            summary["total"] += 1
            decision = run.summary.get("decision", "active")
            summary[decision] += 1
            metric = run.summary.get("final_metric", 0)
            if metric > summary["best_metric"]:
                summary["best_metric"] = metric
        return summary

Translating Experiments to Business Metrics

Stakeholders don't care about accuracy. They care about outcomes. Your job is to translate:

ML metric	Business translation	How to present
Accuracy: 72% → 88%	"We went from deflecting 72% of support tickets to 88% — that's 1,600 fewer human-handled tickets per week"	Show the dollar savings: 1,600 tickets × $5 avg cost = $8K/week saved
Latency: 2s → 400ms	"Customer wait time dropped from 2 seconds to under half a second"	Show before/after UX recording
Hallucination rate: 8% → 2%	"The chatbot now gives wrong answers 1 in 50 times instead of 1 in 12"	Show specific examples of prevented hallucinations

FRONTIER: Automated Experiment Summarization

LLM-powered experiment summaries: Feed your W&B experiment logs to an LLM at the end of each sprint. Generate: (1) Plain-English summary of what was tried and learned. (2) Recommendations for next sprint's experiments. (3) Risk assessment: "At current trajectory, we'll hit target in 3 sprints." This becomes your sprint review presentation.

Experiment Dashboard

A simulated experiment tracking dashboard. Each bar represents an experiment's best metric. Green = promoted, red = killed, yellow = active. Click Run Experiment to add results.

The VP asks "How's the AI project going?" Your best model accuracy is 88.3% against a 90% target. What is the most useful response?

"88.3% accuracy on our eval set" "We're making good progress" "We're at 88.3% accuracy, 1.7% from our 90% target. Based on the last 4 experiments' improvement rate, we expect to hit 90% in 2 sprints. The main risk is our safety eval data is delayed, which could push the timeline by one sprint." "We need more time and compute"

Chapter 6: Research-to-Production Handoffs

The researcher posts in Slack: "Model is ready! 92% accuracy! Here's the notebook." The ML engineer opens the notebook. It imports from a local path that doesn't exist on the production server. The data preprocessing uses a different tokenizer than the inference pipeline. The model was trained on Python 3.11 with PyTorch 2.1, but production runs Python 3.9 with PyTorch 1.13. The "92% accuracy" was measured on a test set that accidentally overlapped with the training set. This is the "works in notebook, breaks in prod" problem, and it kills more AI projects than bad models.

CONCEPT: The Notebook-to-Production Gap

Research and production have fundamentally different requirements. A researcher optimizes for speed of iteration. A production engineer optimizes for reliability and scale. The gap between them is where AI projects die.

Dimension	Research	Production	The gap
Code	Jupyter notebook, quick and dirty	Tested, typed, packaged Python modules	Rewrite everything
Data	Local CSV, ad-hoc preprocessing	Versioned datasets, pipeline DAGs	Different preprocessing = different results
Deps	Whatever pip installed today	Locked requirements, container images	Version conflicts, CUDA mismatches
Infra	Single GPU, batch processing	Multi-GPU, real-time, auto-scaling	10x latency at scale
Eval	"Looks good on my test set"	Automated eval suite, A/B test in prod	Offline eval ≠ online performance

The handoff checklist is your most important artifact. Every model that crosses the research-to-production boundary must pass through a checklist. No exceptions. No "we'll fix it in production." The checklist catches 80% of production failures before they happen.

DESIGN: The Production Readiness Checklist

markdown
# Model Production Readiness Checklist

## 1. Reproducibility
- [ ] Training code runs from a single command (not a notebook)
- [ ] All dependencies pinned in requirements.txt / pyproject.toml
- [ ] Docker image builds and runs successfully
- [ ] Random seeds documented; results reproducible within 1%
- [ ] Dataset version tracked (DVC hash or equivalent)

## 2. Evaluation
- [ ] Eval suite passes on PRODUCTION eval set (not training set)
- [ ] No data leakage: train/eval sets verified disjoint
- [ ] Metrics run 5x with different seeds (variance documented)
- [ ] Compared against current production model (no regression)
- [ ] Safety eval passes (toxicity, hallucination, PII)
- [ ] Latency measured under production-like load

## 3. Integration
- [ ] Input/output schema matches API contract
- [ ] Preprocessing pipeline is IDENTICAL to training preprocessing
- [ ] Model serves via the production serving framework (TorchServe, vLLM, etc.)
- [ ] Error handling for malformed inputs
- [ ] Graceful degradation when model times out

## 4. Operational Readiness
- [ ] Model card written (purpose, limitations, biases)
- [ ] Monitoring dashboards configured (accuracy, latency, error rate)
- [ ] Alerting rules set (accuracy drops > 5%, latency p99 > 2x)
- [ ] Rollback procedure tested (revert to previous model in < 5 min)
- [ ] A/B test configured (serve new model to 5%, measure, then ramp)

## 5. Sign-offs
- [ ] ML researcher: "Model meets eval criteria"
- [ ] ML engineer: "Inference pipeline passes integration tests"
- [ ] Data engineer: "Data pipeline feeds correct data"
- [ ] Product manager: "Feature meets user requirements"
- [ ] Security/Legal: "Model complies with policies" (if applicable)

CODE: Handoff Automation

python
# handoff_validator.py — Automated checks for production readiness
import subprocess
import json
from pathlib import Path

class HandoffValidator:
    def check_reproducibility(self, model_dir: Path) -> dict:
        checks = {}
        # 1. requirements.txt exists and is pinned
        req_file = model_dir / "requirements.txt"
        checks["deps_pinned"] = req_file.exists() and all(
            "==" in line for line in req_file.read_text().strip().split("\n")
            if line and not line.startswith("#")
        )
        # 2. Dockerfile exists
        checks["dockerfile"] = (model_dir / "Dockerfile").exists()
        # 3. No notebooks in production code
        checks["no_notebooks"] = not any(model_dir.glob("**/*.ipynb"))
        # 4. Data version tracked
        checks["data_versioned"] = (
            (model_dir / ".dvc").exists() or
            (model_dir / "data_version.json").exists()
        )
        return checks

    def check_eval_integrity(self, eval_config: dict) -> dict:
        checks = {}
        # Verify train/eval sets are disjoint
        train_ids = set(eval_config["train_ids"])
        eval_ids = set(eval_config["eval_ids"])
        overlap = train_ids & eval_ids
        checks["no_data_leakage"] = len(overlap) == 0
        if overlap:
            checks["leaked_ids"] = list(overlap)[:10]
        return checks

    def full_check(self, model_dir: Path, eval_config: dict) -> dict:
        repro = self.check_reproducibility(model_dir)
        eval_int = self.check_eval_integrity(eval_config)
        all_checks = {**repro, **eval_int}
        passed = all(v if isinstance(v, bool) else True for v in all_checks.values())
        return {"passed": passed, "checks": all_checks}

DEBUG: Common Handoff Failures

Failure	Root cause	Detection	Prevention
Model accuracy drops 10% in prod	Preprocessing differs between training and serving	Run eval suite through the SERVING pipeline, not research pipeline	Share preprocessing code between training and serving
Model loads but crashes on edge cases	Input validation missing in serving code	Fuzz testing with malformed inputs	Input schema validation in serving layer
Latency 5x slower than expected	Research used batch processing; prod needs single-request	Load test before promotion	Latency target in experiment ticket
"92% accuracy" was on contaminated eval	Train/eval overlap	Handoff validator catches it	Eval set is created and frozen before ANY training begins

FRONTIER: MLOps as Code

The handoff disappears when research and production share infrastructure. Tools like MLflow Model Registry, Vertex AI Pipelines, and Amazon SageMaker create a continuous path from experiment to deployment. The researcher promotes a model version. The CI/CD pipeline runs the eval suite, builds the container, and deploys with canary rollout. No human handoff needed. The Scrum Master monitors the pipeline instead of coordinating people.

Research-to-Production Pipeline

Watch a model move from research to production. Each gate checks a specific requirement. Red gates block deployment. Click Promote Model to start the handoff.

A researcher says their model achieves 95% accuracy and is "ready for production." What is the FIRST thing you should verify?

That the model can handle production latency requirements That the training and evaluation sets are verified disjoint (no data leakage) — because the most common reason for surprisingly high accuracy is accidentally evaluating on training data That the model has a Dockerfile That the product manager has approved it

Chapter 7: Stakeholder Communication for AI

The CEO walks into your sprint review. The team just achieved 87% accuracy on the customer support chatbot. The CEO asks: "Is 87% good?" The researcher starts explaining F1 scores and confusion matrices. The CEO's eyes glaze over. You step in: "It means the chatbot gives the right answer 87 times out of 100. Our target is 92 — right now, 13 out of 100 customers would get a wrong answer, which would frustrate them. We need five more percentage points. Based on our experiments, we'll get there in about three weeks."

That's the job. Translating uncertainty into actionable information that non-technical people can make decisions with. It's possibly the single most valuable skill an AI Scrum Master has.

CONCEPT: The Three Lies of AI Timelines

Stakeholders ask three questions. Each one invites a lie:

Question	The lie	The truth	How to say it
"When will it be ready?"	"End of Q2"	We don't know. AI timelines are probabilistic.	"Based on our current trajectory, there's a 70% chance we hit the target by end of Q2. The 30% risk is data quality issues."
"How accurate is it?"	"87% accurate"	87% on our eval set — which might not represent real-world usage.	"87% on our test set of 500 questions. We expect 80-85% in production because real users ask harder questions."
"Can you just add [feature]?"	"Sure, next sprint"	Each new capability requires a new eval suite, new data, new experiments.	"We can prototype it, but validating it to production quality is 4-6 weeks of experiments."

"The model is 87% accurate" means nothing without context. Always provide: (1) Accuracy on WHAT data? (2) Compared to WHAT baseline? (3) What does a wrong answer LOOK LIKE? An 87% accurate spam filter is fine. An 87% accurate medical diagnosis system is dangerous. Context determines whether 87% is a celebration or a crisis.

DESIGN: The Stakeholder Communication Framework

markdown
# AI Project Status Report Template

## One-Line Summary
[Current metric] / [Target metric] — [Trajectory statement]
Example: "88% / 92% — On track for 92% in Sprint 7 (3 weeks)"

## Progress Since Last Report
- Experiments completed: 4 (2 killed, 1 promoted, 1 iterating)
- Key learning: [What we learned that changes our approach]
- Best metric improvement: [X% → Y%] via [what technique]

## Risks and Blockers
| Risk | Impact | Likelihood | Mitigation |
|------|--------|------------|------------|
| Data vendor delay | 1 sprint slip | Medium | Using synthetic data as bridge |
| GPU shortage | Can't run 2 experiments in parallel | Low | Reserved spot instances |

## What We Need from Leadership
- [ ] Approval for $3K additional labeling budget
- [ ] Decision: ship at 90% or wait for 92%?
- [ ] Legal review scheduled before launch date

## Next Milestone
[What we'll demonstrate at the next sprint review]

Managing Timeline Expectations

Use confidence cones instead of single-point estimates. A confidence cone shows the range of possible outcomes:

AI Project Confidence Cone

Visualize how confidence narrows as the project progresses. Early estimates have wide ranges. As experiments provide data, the range shrinks. Adjust the progress slider to see how uncertainty decreases.

Project Week 1

CODE: Automated Status Report Generator

python
# status_report.py — Generate stakeholder-friendly status from experiment data
from datetime import datetime

def generate_status_report(experiments: list, target_metric: float,
                            sprint_num: int, total_sprints: int) -> str:
    completed = [e for e in experiments if e["status"] != "active"]
    best = max((e["best_metric"] for e in completed), default=0)
    gap = target_metric - best
    killed = len([e for e in completed if e["status"] == "killed"])
    promoted = len([e for e in completed if e["status"] == "promoted"])

    # Estimate sprints to target based on improvement rate
    metrics_by_sprint = {}  # Group best metric per sprint
    for e in completed:
        s = e.get("sprint", sprint_num)
        if s not in metrics_by_sprint or e["best_metric"] > metrics_by_sprint[s]:
            metrics_by_sprint[s] = e["best_metric"]

    if len(metrics_by_sprint) >= 2:
        sprints = sorted(metrics_by_sprint.keys())
        improvement_per_sprint = (
            metrics_by_sprint[sprints[-1]] - metrics_by_sprint[sprints[0]]
        ) / (sprints[-1] - sprints[0])
        if improvement_per_sprint > 0:
            sprints_to_target = gap / improvement_per_sprint
            eta = f"~{sprints_to_target:.0f} sprints at current rate"
        else:
            eta = "STALLED — improvement rate is zero"
    else:
        eta = "Insufficient data for estimate"

    return f"""
# Sprint {sprint_num} Status ({datetime.now().strftime('%B %d')})
Best metric: {best:.1%} / Target: {target_metric:.1%} (gap: {gap:.1%})
ETA to target: {eta}
Experiments: {len(completed)} completed ({killed} killed, {promoted} promoted)
"""

DEBUG: When Stakeholder Communication Breaks

Symptom	Root cause	Fix
Executives surprise-ask for features mid-sprint	They don't understand the experiment cycle	Educate: "Each new capability = 2-4 sprint experiment cycle"
PM overpromises to customers	You gave a best-case estimate without the range	Always give confidence ranges: "70% chance by June, 90% by July"
Team is demoralized by "failed" experiments	Success is framed as accuracy numbers, not learning	Reframe: every sprint review starts with "What we learned"
Board thinks AI is a waste of money	No connection between experiments and business value	Translate every metric to dollars or customer impact

FRONTIER: AI-Generated Stakeholder Reports

LLM-powered reporting: Your experiment tracking system feeds into an LLM that generates weekly stakeholder updates. It reads the experiment logs, identifies key learnings, calculates trajectory, and writes the report in executive-friendly language. You review and send. This saves 2-3 hours per week and ensures consistency.

Your AI model achieves 87% accuracy. The target is 92%. A stakeholder asks: "Can we ship at 87%?" What information do you need to provide for a good decision?

"No, we need 92%" "Yes, 87% is close enough" Show the concrete impact: "At 87%, 13 out of 100 users get wrong answers. Here are 5 examples of what wrong answers look like. The cost of these errors is approximately $X/week. Reaching 92% would take ~3 more sprints. The business decision is: is $X/week of errors acceptable for 6 weeks of earlier launch?" "Let me check with the researcher"

Chapter 8: GenAI/LLM-Specific Scrum

Your team is building a customer support chatbot powered by an LLM. The "code" is a 500-word system prompt. The "testing" is running 200 customer questions and having three humans grade the answers. The "deployment" is changing an API key from GPT-3.5-turbo to GPT-4o. The "performance optimization" is rewriting a paragraph of the prompt. Nothing about this looks like traditional software development, and your sprint process needs to reflect that.

CONCEPT: Prompt Engineering as a Sprint Activity

Prompt engineering sprints are the new unit of work for GenAI teams. A single prompt change can shift model behavior more than weeks of fine-tuning. But prompt changes are also unpredictable — a change that improves one capability can degrade another.

Activity	Traditional ML equivalent	Sprint time
System prompt iteration	Architecture search	2-5 days per major revision
Few-shot example curation	Training data curation	1-3 days
RAG pipeline tuning	Feature engineering	1-2 weeks
Eval set creation	Test suite authoring	3-5 days (ongoing)
Model migration (GPT-4 → Claude)	Framework migration	2-4 weeks (prompt rewriting + re-eval)
Fine-tuning	Model training	1-2 weeks (data prep + training + eval)

Eval-driven development for LLMs. In traditional software, you write tests first, then code. In GenAI, you write evals first, then prompts. Every prompt change MUST be validated against the eval set before merging. This is non-negotiable. A "small prompt tweak" can catastrophically degrade one category of answers while improving another.

DESIGN: The GenAI Sprint Structure

yaml
# genai_sprint_template.yaml

sprint_week_1:
  monday:
    - Review last sprint's eval results
    - Prioritize prompt improvements by impact
    - Assign RAG pipeline experiments

  tuesday_thursday:
    - Prompt engineering: iterate on system prompt
    - RAG experiments: test different chunking, retrieval, reranking
    - Run eval suite after EACH significant change
    - Daily eval check-in: "What moved? What regressed?"

  friday:
    - Eval freeze: run full eval suite on best candidates
    - Document prompt changelog (version control the prompt!)

sprint_week_2:
  monday_wednesday:
    - A/B test top 2 prompt versions with real traffic (5% canary)
    - Fine-tuning experiment (if applicable)
    - RAG pipeline: index new documents, test retrieval quality

  thursday:
    - Production promotion decision
    - Sprint review prep: compile eval results + business metrics

  friday:
    - Sprint review: show before/after on key scenarios
    - Retrospective: what eval gaps did we discover?
    - Plan next sprint's eval set improvements

CODE: Prompt Version Control

python
# prompt_registry.py — Version control for prompts
import json
import hashlib
from datetime import datetime
from pathlib import Path

class PromptRegistry:
    def __init__(self, registry_path: str = "prompts/"):
        self.path = Path(registry_path)
        self.path.mkdir(exist_ok=True)

    def register(self, name: str, prompt: str,
                  metadata: dict = None) -> str:
        """Register a prompt version with hash-based versioning."""
        version = hashlib.sha256(prompt.encode()).hexdigest()[:8]
        record = {
            "name": name,
            "version": version,
            "prompt": prompt,
            "created_at": datetime.now().isoformat(),
            "metadata": metadata or {},
            "char_count": len(prompt),
            "word_count": len(prompt.split()),
        }
        filepath = self.path / f"{name}_v{version}.json"
        filepath.write_text(json.dumps(record, indent=2))
        return version

    def compare(self, name: str, v1: str, v2: str) -> dict:
        """Diff two prompt versions."""
        p1 = json.loads((self.path / f"{name}_v{v1}.json").read_text())
        p2 = json.loads((self.path / f"{name}_v{v2}.json").read_text())
        return {
            "v1_words": p1["word_count"],
            "v2_words": p2["word_count"],
            "delta_words": p2["word_count"] - p1["word_count"],
            "v1_date": p1["created_at"],
            "v2_date": p2["created_at"],
        }

DESIGN: RAG Pipeline Sprint Workflow

RAG (Retrieval-Augmented Generation) is the backbone of most production GenAI applications. Each component of the RAG pipeline is a separate experiment axis:

Chunking Strategy

How documents are split: fixed-size, semantic, recursive. Each choice affects retrieval quality.

↓

Embedding Model

Which model converts text to vectors. Trade-offs: speed vs. quality vs. cost. OpenAI, Cohere, open-source.

↓

Retrieval

Vector search, BM25, hybrid. Top-k selection. Reranking with cross-encoders.

↓

Synthesis

System prompt + retrieved context + user query = LLM response. Prompt template matters enormously.

Test one variable at a time. The biggest mistake in RAG development is changing the chunking strategy, embedding model, AND prompt template simultaneously. You can't learn anything because you don't know what caused the change. Run experiments that isolate one variable. This is slower but produces actionable results.

Model Migration Planning

Every 6-12 months, a new model generation launches (GPT-4 → GPT-4o → GPT-5, Claude 3 → Claude 4). Migration is a multi-sprint project:

yaml
# model_migration_plan.yaml — Claude 3.5 → Claude 4

sprint_1_eval:
  - Run FULL eval suite on new model with existing prompts
  - Identify regressions (new model is different, not just better)
  - Benchmark latency and cost differences
  - Decision: is the upgrade worth the migration effort?

sprint_2_prompt_adaptation:
  - Rewrite prompts for new model's capabilities/quirks
  - New model may need less hand-holding (remove workarounds)
  - New model may have different failure modes (add guardrails)
  - Run eval suite after each prompt revision

sprint_3_integration:
  - Update API integration (new endpoints, parameters)
  - Update token budgets (new model may have different context window)
  - Load test under production traffic patterns
  - Canary deployment: 5% traffic to new model

sprint_4_rollout:
  - Monitor canary for 1 week
  - Ramp to 50%, then 100%
  - Keep old model warm for 2 weeks (rollback safety)
  - Update documentation and model card

FRONTIER: Multi-Model Orchestration

Router models: The frontier isn't using one LLM for everything. It's using a cheap, fast model (GPT-3.5/Haiku) for simple queries and routing complex ones to expensive models (GPT-4/Opus). Sprint planning for multi-model systems involves optimizing the router as well as the models. This is a new experiment axis that didn't exist two years ago.

GenAI Sprint Eval Tracker

Track prompt versions and their eval scores across sprints. Each bar is a prompt version. Click New Prompt Version to simulate iterating on the system prompt.

Your team changes the system prompt to improve the chatbot's handling of refund questions. Refund accuracy goes from 78% to 91%. But you notice the general Q&A accuracy dropped from 89% to 82%. What is the correct sprint action?

Roll back the prompt change. A regression on general Q&A outweighs the refund improvement. Investigate why the change caused a regression (likely the new instructions conflict with general answering patterns), then iterate on a prompt that improves refunds WITHOUT degrading general Q&A. Ship it — refund improvement is more important Use two separate prompts for refund and general questions Ask the researcher to investigate next sprint

Chapter 9: Agentic AI Project Management

Your team is building an AI agent that can research a topic, write a report, and email it to a customer. In testing, the agent works beautifully 85% of the time. The other 15%? It sends emails to the wrong person. It cites sources that don't exist. It writes a report about the wrong topic because it misinterpreted the request. And once, memorably, it entered an infinite loop and sent 47 emails before anyone noticed.

Agentic AI — systems where an LLM takes actions in the real world (calling APIs, executing code, modifying databases, sending communications) — is the most unpredictable type of AI project to manage. The failure modes aren't just "wrong answer." They're "wrong action with real-world consequences."

CONCEPT: Why Agents Are Harder to Manage

Dimension	Traditional ML	Chatbot/LLM	Agentic AI
Failure mode	Wrong prediction	Wrong answer	Wrong ACTION (sends email, deletes data, charges money)
Blast radius	One user sees wrong result	One user gets bad answer	Agent modifies external systems irreversibly
Testing	Eval set, accuracy metrics	Human grading, automated evals	End-to-end trajectory testing, sandbox environments
Debugging	Check model weights, features	Read the prompt, check context	Trace multi-step reasoning across tool calls
Sprint predictability	Low (experiments)	Medium (prompt iteration is fast)	Very low (emergent behavior from tool combinations)

Agent development is iterative and unpredictable. An agent's behavior emerges from the interaction between its prompt, its tools, and its reasoning. You can't predict what an agent will do just by reading its code. You discover behavior by testing — extensively, in sandbox environments, with real-world-like scenarios. Plan your sprints around testing, not building.

DESIGN: Agent Development Sprint Structure

yaml
# agent_sprint_structure.yaml

# Phase 1: Tool Integration (1-2 sprints per tool)
tool_sprints:
  each_tool:
    - Define tool's API contract (input/output schemas)
    - Implement tool with error handling and rate limiting
    - Write unit tests for the tool in isolation
    - Write integration test: agent calls tool correctly
    - Write adversarial test: agent handles tool failure gracefully
    - Safety review: what happens if agent misuses this tool?

# Phase 2: Behavior Testing (ongoing, every sprint)
behavior_testing:
  trajectory_tests:
    - Define 50+ test scenarios with expected action sequences
    - Run agent in sandbox, record full trajectory
    - Grade: correct actions? correct order? no harmful actions?
    - Regression test: adding tool B didn't break tool A behavior

  adversarial_tests:
    - Prompt injection: user tries to make agent do unauthorized actions
    - Edge cases: what if the tool returns an error?
    - Loops: does the agent ever enter infinite tool-calling loops?
    - Scope creep: does the agent stay within its defined capabilities?

# Phase 3: Safety Review Gates
safety_gates:
  before_sandbox: "Agent can only call mock tools"
  before_staging: "Agent calls real tools but in test environment"
  before_production: "Full safety review, rate limits, kill switch"

CODE: Agent Trajectory Logger

python
# agent_trajectory.py — Log and analyze agent action sequences
from dataclasses import dataclass, field
from typing import List, Optional
from datetime import datetime

@dataclass
class AgentStep:
    step_num: int
    thought: str              # Agent's reasoning
    tool_name: str            # Which tool it called
    tool_input: dict          # What it passed to the tool
    tool_output: str          # What the tool returned
    timestamp: datetime = field(default_factory=datetime.now)

@dataclass
class Trajectory:
    task: str
    steps: List[AgentStep] = field(default_factory=list)
    final_output: Optional[str] = None
    success: Optional[bool] = None

    @property
    def tool_sequence(self) -> List[str]:
        return [s.tool_name for s in self.steps]

    @property
    def has_loop(self) -> bool:
        """Detect if agent repeated the same tool call 3+ times."""
        for i in range(len(self.steps) - 2):
            if (self.steps[i].tool_name == self.steps[i+1].tool_name
                    == self.steps[i+2].tool_name):
                if (self.steps[i].tool_input == self.steps[i+1].tool_input
                        == self.steps[i+2].tool_input):
                    return True
        return False

    @property
    def unauthorized_actions(self) -> List[AgentStep]:
        """Flag steps that used tools outside the allowed set."""
        allowed = {"search", "read_doc", "write_report", "send_email"}
        return [s for s in self.steps if s.tool_name not in allowed]

Multi-Agent Coordination

Some systems use multiple agents that collaborate: a planner agent, a researcher agent, a writer agent, and a reviewer agent. Coordinating multi-agent systems in sprints requires treating agent interactions as integration points:

Sprint activity	Single agent	Multi-agent
Testing	Test one agent's behavior	Test agent HANDOFFS: does agent A's output format match agent B's expected input?
Debugging	Read one trajectory	Trace across agent boundaries: which agent introduced the error?
Planning	One set of capabilities	Dependency graph: agent C can't be developed until agent A's API stabilizes
Safety	One agent's action space	Emergent behavior: agents A and B are safe alone, but together they escalate privileges

DEBUG: Agent Failure Analysis

Failure mode	How to detect	How to fix
Infinite loop	Step count exceeds max (e.g., 20 steps)	Hard step limit + loop detection in trajectory logger
Wrong tool selection	Trajectory shows agent used "delete" when it should have used "update"	Better tool descriptions, few-shot examples in prompt
Scope creep	Agent performs actions not requested by user	Explicit instruction: "Only perform actions the user specifically requested"
PII exposure	Agent passes customer data to external tool	PII filter on all tool inputs. Block tool calls containing PII patterns.

FRONTIER: Self-Improving Agents

Agents that learn from failures: The frontier is agents that review their own trajectories, identify failure patterns, and propose prompt improvements. After a sprint of agent testing, the agent itself writes a "retrospective" analyzing its failures and suggesting fixes. The Scrum Master reviews these suggestions alongside the team. This is meta-agility — the agent participates in its own improvement process.

Agent Trajectory Viewer

Watch an AI agent execute a multi-step task. Each node is a tool call. Green = successful step, red = failure, yellow = loop detection. Click Run Agent to simulate an execution.

Your AI agent passes all 200 test scenarios in the sandbox. The team wants to deploy to production. What is the critical step before production deployment?

Run adversarial testing: prompt injection attempts, malformed inputs, and scenarios designed to make the agent take unauthorized actions. Then deploy with rate limits, a kill switch, and human-in-the-loop for high-stakes actions (send email, modify data, charge money). Deploy to 5% of users first Get sign-off from the product manager Write documentation

Chapter 10: Risk Management for AI

It's 2 AM. PagerDuty fires. Your production model's accuracy has dropped from 91% to 67% over the last 6 hours. Customer complaints are flooding in. The support team is escalating. You check the monitoring dashboard: the model itself hasn't changed, but the input distribution has. A viral social media post is driving a new type of question your model was never trained to handle. This is data drift, and it's the most common production failure in AI systems.

CONCEPT: The AI Risk Taxonomy

AI systems face risks that traditional software doesn't. Your sprint process must include explicit checkpoints for each category:

Risk category	Examples	Sprint checkpoint
Model regression	New model version performs worse on a subpopulation	Full eval suite before every model promotion
Data drift	Input distribution changes in production vs. training	Weekly distribution monitoring, alerting on drift metrics
Safety incidents	Toxic output, hallucinated facts, PII leakage	Safety eval gate before deployment + continuous monitoring
Bias detection	Model performs worse for certain demographics	Fairness eval: slice metrics by demographic category
Compliance	EU AI Act, GDPR data usage, industry regulations	Legal review gate before launch, quarterly compliance audit
Operational	GPU shortage, training failure, cost overrun	Compute budget tracking, cost alerts

Responsible AI is a sprint activity, not a one-time review. Don't save the safety review for the end. Embed responsible AI checkpoints throughout the sprint: bias checks during data preparation, safety evals during experimentation, fairness audits during promotion, and monitoring after deployment. Every sprint review should include a "responsible AI update."

DESIGN: The AI Risk Register

yaml
# ai_risk_register.yaml — Maintained by AI Scrum Master

risks:
  - id: RISK-001
    category: data_drift
    description: "Input distribution shifts as product usage changes"
    likelihood: high
    impact: high
    current_status: mitigated
    mitigation:
      - "Weekly drift detection (PSI on top 20 features)"
      - "Alert when PSI > 0.2 on any feature"
      - "Retrain pipeline: triggered manually, evaluated automatically"
    sprint_checkpoint: "Weekly drift report in standup (Monday)"
    owner: "ML Engineer (Sarah)"

  - id: RISK-002
    category: safety
    description: "LLM generates harmful content to vulnerable users"
    likelihood: medium
    impact: critical
    current_status: mitigated
    mitigation:
      - "Content safety classifier on all outputs (Llama Guard)"
      - "Block + log any output classified as harmful"
      - "Monthly adversarial red-team testing"
    sprint_checkpoint: "Safety metrics in every sprint review"
    owner: "AI Safety Lead (Marcus)"

  - id: RISK-003
    category: bias
    description: "Model performs worse for non-English speakers"
    likelihood: high
    impact: high
    current_status: monitoring
    mitigation:
      - "Eval suite includes multi-language test set"
      - "Accuracy sliced by language in every eval run"
      - "If gap > 5% between languages, block deployment"
    sprint_checkpoint: "Fairness metrics in eval dashboard"
    owner: "Data Scientist (Priya)"

  - id: RISK-004
    category: compliance
    description: "EU AI Act requires transparency for high-risk AI"
    likelihood: certain
    impact: medium
    current_status: in_progress
    mitigation:
      - "Model card documenting capabilities and limitations"
      - "Human-in-the-loop for high-stakes decisions"
      - "Audit trail: log all model inputs, outputs, and decisions"
    sprint_checkpoint: "Quarterly compliance review with legal"
    owner: "AI Scrum Master (You)"

CODE: Drift Detection Script

python
# drift_detector.py — Monitor input distribution changes
import numpy as np
from typing import Dict, List

def population_stability_index(expected: np.ndarray,
                                actual: np.ndarray,
                                bins: int = 10) -> float:
    """Calculate PSI between training and production distributions.
    PSI < 0.1: no significant change
    PSI 0.1-0.2: moderate change, investigate
    PSI > 0.2: significant change, retrain needed"""
    # Bin the distributions
    breakpoints = np.linspace(
        min(expected.min(), actual.min()),
        max(expected.max(), actual.max()),
        bins + 1
    )
    expected_pcts = np.histogram(expected, breakpoints)[0] / len(expected)
    actual_pcts = np.histogram(actual, breakpoints)[0] / len(actual)
    # Avoid division by zero
    expected_pcts = np.clip(expected_pcts, 0.001, None)
    actual_pcts = np.clip(actual_pcts, 0.001, None)
    # PSI formula
    psi = np.sum((actual_pcts - expected_pcts) *
                 np.log(actual_pcts / expected_pcts))
    return float(psi)

def check_drift(training_data: Dict[str, np.ndarray],
                production_data: Dict[str, np.ndarray],
                threshold: float = 0.2) -> Dict:
    """Check all features for drift. Returns alert if any exceed threshold."""
    results = {}
    alerts = []
    for feature in training_data:
        if feature not in production_data:
            continue
        psi = population_stability_index(
            training_data[feature], production_data[feature]
        )
        status = "OK" if psi < 0.1 else "WARN" if psi < threshold else "ALERT"
        results[feature] = {"psi": round(psi, 4), "status": status}
        if status == "ALERT":
            alerts.append(f"{feature}: PSI={psi:.3f}")
    return {"features": results, "alerts": alerts,
            "needs_retrain": len(alerts) > 0}

DEBUG: Incident Response Playbook

Incident type	Detection	Immediate action	Root cause
Accuracy drop > 10%	Monitoring alert	Rollback to previous model	Check for data drift, eval set contamination, infra issue
Toxic output reported	Content filter log, customer report	Add input to block list, escalate to safety team	Adversarial input, training data contamination, filter gap
PII in model output	PII scanner on outputs	Kill the response, notify affected user, log for compliance	PII in training data (data pipeline failure)
Agent performs unauthorized action	Trajectory logger, audit trail	Disable agent, review all recent actions	Prompt injection, missing guardrails, tool permission error

FRONTIER: Automated Red-Teaming

LLM-powered adversarial testing: Use one LLM to attack another. The "red team" LLM generates adversarial prompts designed to make your production model fail (produce toxic content, hallucinate, leak data). Run this automatically every sprint. Tools: Giskard, Microsoft Counterfit, NVIDIA NeMo Guardrails. This is becoming a standard sprint activity for responsible AI teams.

AI Risk Heatmap

Visualize your AI project's risk landscape. Each cell represents a risk category. Color intensity shows severity (likelihood × impact). Click risk categories to toggle mitigations.

Your production model's accuracy dropped from 91% to 78% overnight. The model weights haven't changed. What is the most likely cause and first diagnostic step?

The eval suite is broken Data drift: the input distribution has changed. First diagnostic step is to compare today's input distribution against the training distribution using PSI or similar metric. Check for new types of inputs the model wasn't trained on. Someone deployed a new model without telling you The GPU is overloaded

Chapter 11: Interactive AI Sprint Board

Everything we've discussed comes together here. This is a living Kanban board designed for AI teams. It has the columns we defined in Chapter 2: Hypothesis, Experiment, Evaluation, Production, and Killed. But it also simulates the chaos of real AI sprints — blockers appear, experiments fail, stakeholders change scope, and GPUs run out.

This simulation teaches the core skill of an AI Scrum Master: reading the board, identifying bottlenecks, and making decisions under uncertainty. Watch where cards pile up. That's your bottleneck. Watch what happens when you inject a blocker. That's your rehearsal for the real thing.

How to Use the Simulation

Control	What it does	What to watch
Advance Sprint	Moves time forward. Cards progress through columns based on probability.	Watch how experiments flow. Most should end up in "Killed" — that's healthy.
Add Experiment	Adds a new hypothesis card to the board.	Watch if the board gets overloaded. Too many active experiments = WIP limit exceeded.
Data Quality Issue	Injects a data blocker. Experiments in "Experiment" stage stall.	Watch the cascading effect: blocked experiments push back the entire sprint.
GPU Shortage	Injects a compute blocker. Only 1 experiment can run at a time.	Watch how the queue backs up. This is why compute planning matters.
Eval Regression	A promoted model fails regression testing. Bounces back to "Evaluation."	Watch the cost of late-stage failure. All the downstream work is wasted.
Scope Change	Stakeholder adds new requirements mid-sprint.	Watch how scope creep disrupts the experiment pipeline.

AI Sprint Board Simulation

A Kanban board for AI teams. Cards are experiments flowing through stages. Inject blockers to simulate real-world disruptions. Watch how the sprint adapts.

The Architecture Behind the Simulation

Every column in the simulation maps to a real workflow stage. Here's the production sprint board configuration:

yaml
# ai_kanban_config.yaml

columns:
  hypothesis:
    wip_limit: 5
    card_fields: [hypothesis, success_criteria, compute_budget]
    exit_gate: "Dataset ready, baseline measured"

  experiment:
    wip_limit: 3    # Limited by GPU availability
    card_fields: [training_status, current_metric, compute_used]
    exit_gate: "Training complete, results logged"
    blockers: [data_quality, gpu_shortage, training_divergence]

  evaluation:
    wip_limit: 4
    card_fields: [eval_results, regression_check, safety_check]
    exit_gate: "Kill/iterate/promote decision made + documented"

  production:
    wip_limit: 2    # Don't deploy too many models at once
    card_fields: [deploy_status, monitoring_status, rollback_tested]
    exit_gate: "Model serving in production, monitoring active"

  killed:
    wip_limit: none
    card_fields: [final_metric, kill_reason, learnings]
    exit_gate: none  # Terminal state

# Sprint metrics derived from board state:
metrics:
  throughput: "Experiments completed (killed + promoted) per sprint"
  cycle_time: "Average days from Hypothesis to Decision"
  kill_rate: "% of experiments killed (healthy: 60-80%)"
  promotion_rate: "% of experiments promoted (healthy: 10-30%)"
  blocker_frequency: "Blockers injected per sprint (track over time)"

Interview Whiteboard Version

In an interview, you have 5 minutes to draw this board on a whiteboard. Here's the key talking points:

Hypothesis → Experiment

Gate: data ready + baseline measured. WIP limit: 3 active experiments (GPU constraint). Key metric: cycle time.

↓

Experiment → Evaluation

Gate: training complete + results logged. Automated eval suite runs on completion. Blockers: data quality, GPU shortage.

↓

Evaluation → Kill / Iterate / Promote

Decision gate: meets criteria? Healthy kill rate: 60-80%. Every kill is documented with learnings.

↓

Production

Deployment with canary rollout. Monitoring active. Rollback tested. Model card written.

Whiteboard tips: (1) Draw the columns left-to-right with WIP limits. (2) Show the feedback loop: Evaluation can go back to Iteration. (3) Emphasize the "Killed" column — it's a feature, not a failure. (4) Talk about blockers: where they hit, how you detect them, how you unblock. (5) End with metrics: throughput, cycle time, kill rate.

Chapter 12: Interview Arsenal

This chapter distills everything into a cheat sheet you can review in the 30 minutes before your interview. Every section maps to a common interview question type for AI Scrum Master / Technical Program Manager roles.

Scenario Questions

Scenario	Key points to cover	Chapter
"Your AI team hasn't shipped anything in 3 sprints"	Diagnose: Are experiments running but failing? (Healthy.) Are experiments not starting? (Blocker.) Is the team afraid to kill experiments? (Process.) Switch from outcome-based to learning-based velocity.	1, 2
"The researcher says it will work, the engineer says it won't scale"	This is the research-to-prod gap. Run a production readiness checklist. Time-box the scalability investigation to 1 sprint. If it can't scale, it can't ship.	6
"Stakeholders want to launch the chatbot but accuracy is only 85%"	Frame the decision: what does 15% wrong look like? Show concrete failure examples. Quantify the business cost of errors vs. the cost of delay. Propose: launch with human-in-the-loop for low-confidence answers.	3, 7
"Design the sprint process for a new GenAI feature"	Hypothesis-driven tickets, eval-driven acceptance, prompt version control, RAG pipeline as experiment axis, safety review gate before deployment.	2, 8
"How do you manage risk for an AI agent in production?"	Trajectory logging, adversarial testing, rate limits, kill switch, human-in-the-loop for high-stakes actions, continuous monitoring of agent behavior.	9, 10

Common Interview Formats

Format	Duration	What they test	How to prepare
Case study	45-60 min	Given a scenario, design the sprint process	Practice with the scenarios above. Draw boards.
Behavioral	30-45 min	"Tell me about a time you managed an AI project"	STAR method: Situation, Task, Action, Result. Quantify results.
Technical	30-45 min	"Explain MLOps, eval pipelines, data versioning"	Review Chapters 3-6. Know the tools: W&B, MLflow, DVC.
Stakeholder sim	30 min	Interviewer plays a frustrated PM or confused exec	Practice the translation framework from Chapter 7.
Whiteboard	30-45 min	Draw the AI sprint board, experiment lifecycle	Practice drawing Chapter 11's board in 5 minutes.

Resource	Type	Why it matters
PSM I / PSM II (Scrum.org)	Certification	Baseline Scrum knowledge. Employers expect it.
SAFe Agilist	Certification	Enterprise-scale agile. Useful for large AI organizations.
"Accelerate" (Forsgren et al.)	Book	DORA metrics, deployment frequency, lead time. Apply to ML.
"Designing Machine Learning Systems" (Huyen)	Book	The best MLOps book. Covers the full production lifecycle.
"Building LLM Apps" (Huyen)	Book	Practical guide to GenAI systems. Eval, RAG, prompt engineering.
MLOps Community	Community	Slack + meetups. Stay current on tooling and practices.
Google "Rules of ML"	Guide	Martin Zinkevich's 43 rules. Timeless wisdom for ML projects.

Cheat Sheet: The Five Dimensions

Interview Dimension Explorer

Click each dimension to see the key topics you should be able to discuss in an interview. This is your study guide.

Your 30-Second Elevator Pitch

When asked "Why should we hire you as an AI Scrum Master?", here's the structure:

text
"I've managed AI teams where 80% of experiments fail — and that's healthy.
My approach differs from traditional Scrum in three ways:

1. HYPOTHESIS-DRIVEN tickets instead of user stories. Each sprint,
   we commit to running N experiments, not shipping N features.
   Success is measured in learning velocity, not story points.

2. EVAL-DRIVEN acceptance. Every model change runs through an
   automated eval suite with statistical significance testing.
   We don't ship on vibes — we ship on data.

3. STAKEHOLDER TRANSLATION. I convert 'the model is 87% accurate'
   into '13 out of 100 customers get wrong answers, costing us
   $X per week.' Leadership makes decisions on business impact,
   not ML metrics.

I also embed responsible AI throughout the sprint — bias checks,
safety evals, drift monitoring — not as an afterthought but as
standard sprint activities."

In an interview, you're asked: "What metric do you use to measure AI team velocity?" What is the best answer?

Story points completed per sprint Experiments completed per sprint (with kill/promote decisions documented), because this measures the team's learning velocity and decision-making quality, not just output volume. A sprint that kills 4 experiments with clear learnings is more valuable than one that finishes 0 experiments with no decisions. Lines of code per developer Number of models deployed