Day In The Life

AI/GenAI Scrum Master

Interview prep for leading AI teams: experiment-driven sprints, eval pipelines, stakeholder communication, and shipping models to production.

Prerequisites: Scrum/Agile experience + Basic ML understanding. That's it.
13
Chapters
13+
Simulations
5
Interview Dimensions

Chapter 0: What an AI Scrum Master IS

It is Monday morning. Your ML team's two-week sprint just ended. The results: one experiment hit target accuracy but needs 4x the GPU budget to serve in production. Another experiment failed completely — the model overfit to training data and generalizes poorly. A third experiment is "promising" but the researcher wants two more weeks to try a different architecture. The product manager is asking why the chatbot feature isn't ready for the demo on Friday. The VP wants a "timeline to ship."

You are the person who makes sense of this chaos. Not by writing code. Not by training models. But by creating the environment where a team of brilliant, sometimes chaotic researchers and engineers can do their best work — and ship it.

This is not traditional Scrum. Traditional Scrum assumes you can break work into user stories, estimate them in story points, and deliver a predictable increment every two weeks. AI projects violate every one of those assumptions. Experiments fail 80% of the time. "Done" is a spectrum of accuracy, not a checkbox. Training runs take days, not hours. And the most important work often looks like someone staring at a loss curve for three hours.

The AI Scrum Master is a new kind of role that sits at the intersection of technical program management, research facilitation, and organizational translation. You don't need to be able to write a transformer from scratch. But you need to understand enough to know when an experiment is stuck, when a researcher is chasing a dead end, and when a "90% accurate" model is actually terrible for the use case.

ResponsibilityWhat you ownDaily intersection
Sprint FacilitationExperiment-based planning, hypothesis-driven tickets, adaptive velocityEvery standup, every planning session, every retro
Experiment GovernanceKill/continue decisions, resource allocation, experiment tracking visibilityResearchers need GPUs, stakeholders need progress reports
Data Pipeline CoordinationData readiness, annotation quality, labeling vendor managementData is the #1 blocker — you unblock it
Research-to-Prod BridgeHandoff checklists, technical debt tracking, ML engineering coordinationThe notebook-to-production gap kills more AI projects than bad models
Stakeholder TranslationUncertainty communication, timeline management, expectation settingExecutives don't understand why "the model isn't ready yet"
The experiment-driven vs feature-driven tension. In traditional software, you plan features and ship them. In AI, you plan experiments and learn from them. An experiment that proves an approach doesn't work is a SUCCESS — it saved you months of building the wrong thing. Your job is to create a system where the team can fail fast, learn fast, and converge on what works. The moment you treat AI work like feature work, you lose.

A Day in the Life

Here's what a typical Wednesday looks like for an AI Scrum Master at a GenAI startup:

TimeActivitySkills used
9:00 AMCheck overnight training runs: one converged, one diverged (NaN loss at epoch 47)Experiment tracking, debugging
9:30 AMStandup: researcher reports eval regression after data refresh, ML engineer blocked on GPU quotaSprint facilitation, resource management
10:00 AMTriage the eval regression: new labeling batch has quality issues, escalate to vendorData pipeline management
11:00 AMSprint planning prep: write hypothesis-driven tickets for next sprint's RAG experimentsExperiment-based planning
12:00 PMStakeholder sync: explain why the chatbot needs another sprint — it hallucinates on edge casesStakeholder communication
2:00 PMReview handoff checklist: model is ready for production but needs latency optimizationResearch-to-prod bridge
3:00 PMSafety review: new agent can now execute code — needs guardrails before deploymentRisk management, responsible AI
4:00 PMRetro: discuss why the last sprint's velocity was half the forecast (GPU shortage + scope creep)Process improvement

The Experiment Funnel

The diagram below shows the flow of work in an AI team. Unlike a traditional software pipeline where features flow left-to-right with predictable completion, experiments flow through a funnel where most are killed. This is healthy.

1. Hypothesis Formation
Researcher proposes: "Fine-tuning Llama-3 on our domain data will improve accuracy from 72% to 85% on our eval set." This is the ticket. Not "build chatbot."
2. Experiment Design
Define: dataset, model, hyperparameters, eval metrics, success threshold, compute budget, time box. All BEFORE training starts.
3. Execution
Training runs, prompt engineering iterations, RAG pipeline experiments. Daily check-ins on loss curves and early metrics.
4. Evaluation
Run eval suite. Compare to baseline. Check for regressions on existing capabilities. Document results in experiment log.
5. Decision Gate
KILL (didn't meet threshold), ITERATE (promising, needs refinement), or PROMOTE (ready for production pipeline).
AI Team Experiment Funnel

Watch experiments flow through the funnel. Most are killed — that's healthy. Click Add Experiment to inject new hypotheses. Click Kill Stale to remove experiments that exceeded their time box.

Interview Dimensions

Staff-level interviews test you across five dimensions. Each chapter in this lesson maps to one or more:

DimensionWhat they askChapters
CONCEPT"Explain why story points don't work for ML research"All
DESIGN"Design a sprint process for a team building an LLM-powered product"0, 1, 2, 8, 11
CODE"Show me the Jira board / experiment tracker / eval dashboard you'd set up"2, 4, 5, 6, 8
DEBUG"Your team hasn't shipped anything in 3 sprints. Diagnose the problem."3, 7, 10
FRONTIER"How will AI-assisted project management change this role?"All
Your AI team just completed a two-week sprint. Two out of three experiments failed to meet their accuracy thresholds. The product manager is frustrated because "nothing shipped." What is the correct framing of this sprint's outcome?

Chapter 1: AI Project Lifecycle

A product manager shows you a roadmap: "Q1: build the model. Q2: integrate it. Q3: launch." You know immediately this will fail. AI projects don't work in sequential phases. The lifecycle is a loop, not a line — and the most important skill is knowing when to exit the loop.

The AI project lifecycle looks like this: research, prototype, evaluate, iterate, and — only when evaluation meets production criteria — ship. But here's what makes it different from traditional software: you might loop through research-prototype-evaluate fifteen times before anything is production-ready. And three of those loops might end in "this approach fundamentally doesn't work, start over."

The Explore/Exploit Tradeoff

Every sprint, you face a decision borrowed from reinforcement learning: explore (try new approaches, architectures, datasets) or exploit (optimize what's already working). Early in a project, you should be 80% explore, 20% exploit. As you approach a deadline, it flips: 20% explore, 80% exploit.

The Scrum Master's job is to manage this ratio explicitly. If the team is exploring too late, they'll never ship. If they're exploiting too early, they'll ship a mediocre model because they never found the right approach.

PhaseExplore:ExploitSprint focusYour role
Discovery (0-4 weeks)90:10Wide search — try 5 approaches, kill 4Protect research time, resist pressure to "pick one and go"
Convergence (4-8 weeks)50:502-3 approaches competing, deeper experimentsSet decision gates, track eval metrics, prepare kill criteria
Optimization (8-12 weeks)10:90One approach, hyperparameter tuning, edge casesTrack diminishing returns, push for production readiness
Productionization (12-16 weeks)5:95Latency, reliability, monitoring, deploymentCoordinate ML eng + infra, handoff checklists, launch planning
When to kill an experiment. This is the hardest decision in AI project management. Kill too early and you miss breakthroughs. Kill too late and you waste months. Use three criteria: (1) Has it consumed more than 2x its budgeted compute? Kill it. (2) Has it been stuck at the same accuracy for 3+ training runs with different hyperparameters? Kill it. (3) Is the gap between current and target accuracy larger than what any known technique could close? Kill it. Document every kill decision — it's organizational learning.

CONCEPT: Why Waterfall Fails for AI

Traditional waterfall assumes you can specify requirements upfront, build to those requirements, and test against them. AI violates all three assumptions:

yaml
# Traditional software: requirements are deterministic
requirement: "When user clicks 'Submit', save form data to database"
test: "Assert DB contains the submitted data after click"
estimate: "3 story points (1-2 days)"

# AI project: requirements are probabilistic
requirement: "Chatbot answers customer questions accurately"
test: "What does 'accurately' mean? 90%? 95%? On which questions?
       Measured how? By whom? What's the baseline?"
estimate: "Unknown. Could be 2 weeks or 6 months depending on
           data quality, model choice, and what 'accurate' means."

DESIGN: The AI Project Canvas

Before any sprint planning, create an AI Project Canvas — a one-page document that aligns the team on what success looks like:

markdown
# AI Project Canvas: Customer Support Chatbot

## Business Objective
Reduce ticket volume by 30% by deflecting common questions to AI.

## Success Metrics (ordered by priority)
1. Deflection rate: % of conversations resolved without human handoff
2. Customer satisfaction: CSAT score >= 4.2/5 on AI-handled conversations
3. Accuracy: Correct answer rate >= 92% on eval set (500 questions)
4. Latency: p95 response time < 2 seconds

## Data Availability
- 50K historical support tickets (labeled by category, resolution)
- 2K manually curated Q&A pairs for eval
- Real-time access to product docs (RAG source)
- GAP: No labeled "bad answer" examples for safety eval

## Technical Constraints
- Budget: $5K/month inference cost (rules out GPT-4 at scale)
- Latency: Must respond in < 2s (rules out chain-of-thought with 5 LLM calls)
- Privacy: No customer PII sent to external APIs (rules out OpenAI for EU customers)
- Infra: Kubernetes cluster with 4x A100 GPUs available

## Known Risks
- Hallucination on product-specific questions (mitigation: RAG + citations)
- Data drift as product changes (mitigation: weekly eval re-runs)
- Regulatory review needed before launch (blocker: 3-week legal review)

## Time Box
- Discovery: 3 sprints (6 weeks)
- Ship MVP: Sprint 5 (week 10)
- Kill criteria: If accuracy < 80% after sprint 3, pivot approach

CODE: Project Lifecycle Tracker

python
# ai_lifecycle_tracker.py — Track project phase and explore/exploit ratio
from dataclasses import dataclass, field
from datetime import datetime, timedelta
from enum import Enum
from typing import List

class Phase(Enum):
    DISCOVERY = "discovery"
    CONVERGENCE = "convergence"
    OPTIMIZATION = "optimization"
    PRODUCTION = "production"

@dataclass
class Experiment:
    hypothesis: str
    status: str = "active"        # active | killed | promoted
    start_date: datetime = None
    compute_budget_hrs: float = 0
    compute_used_hrs: float = 0
    best_metric: float = 0.0
    metric_history: List[float] = field(default_factory=list)

    def should_kill(self, target_metric: float) -> tuple[bool, str]:
        # Rule 1: Over budget
        if self.compute_used_hrs > 2 * self.compute_budget_hrs:
            return True, "2x compute budget exceeded"
        # Rule 2: Stuck — last 3 runs within 1% of each other
        if len(self.metric_history) >= 3:
            recent = self.metric_history[-3:]
            if max(recent) - min(recent) < 0.01:
                return True, "Metric plateau (3 runs within 1%)"
        # Rule 3: Gap too large
        if self.best_metric > 0 and (target_metric - self.best_metric) > 0.15:
            return True, f"Gap to target ({target_metric - self.best_metric:.1%}) too large"
        return False, "Continue"

DEBUG: When the Lifecycle Stalls

Signs your project is stuck in the wrong phase:

SymptomLikely causeFix
Still exploring after 8 weeksNo kill criteria defined — team can't commitSet a hard decision gate: "By sprint 5, we pick one approach"
Optimizing a 75% model to 77%Wrong phase — should still be exploringStep back. Is 77% good enough? If not, a different architecture might get 90%
Never shipping "because we can improve it"Perfectionism / fear of production failuresShip with guardrails. A 90% model with good fallbacks beats a 95% model that never ships
Researcher working alone for 3 weeksNo visibility into experiment progressDaily experiment stand-ups. "What did you try? What did you learn? What's next?"

FRONTIER: AI-Assisted Lifecycle Management

The frontier is using AI to manage AI projects. Tools emerging now:

Auto-experiment scheduling: Systems like Determined AI and Vertex AI Experiments automatically schedule hyperparameter searches, track results, and surface the Pareto-optimal runs (best accuracy vs. compute cost). The Scrum Master reviews the dashboard instead of asking researchers for updates.
LLM-powered retrospectives: Feed your sprint's experiment logs, Slack discussions, and PR descriptions into an LLM. Ask it: "What patterns do you see in our failed experiments? What should we try next?" The LLM becomes a research advisor that remembers everything.
Explore / Exploit Balance Tracker

Adjust the slider to see how the explore/exploit ratio should shift across project phases. The bar chart shows recommended time allocation.

Project Week 1
Your team is in week 10 of a 16-week AI project. The best model so far achieves 84% accuracy against a 90% target. The researcher proposes trying a completely different architecture (a vision transformer instead of a CNN). What should you do?

Chapter 2: Sprint Planning for AI Teams

The team lead says: "This story is 5 points." You ask what the story is. "Fine-tune the model on the new dataset." You ask: how long will training take? "Depends on the data." What accuracy do you expect? "We'll know when we try." Will it work? "Maybe." This is not estimable in story points. And pretending it is creates false confidence that wrecks your sprint velocity and demoralizes the team.

Story points don't work for research. Story points assume you can estimate relative complexity of tasks that you've done before. But in AI, most experiments are novel. You've never fine-tuned this model on this data with these hyperparameters before. The uncertainty is fundamental, not an estimation failure.

CONCEPT: Hypothesis-Driven Tickets

Replace user stories with experiment tickets. Each ticket is a testable hypothesis with a clear accept/reject criterion:

markdown
# Traditional User Story (DOESN'T WORK for AI)
## Story: Improve chatbot accuracy
As a customer, I want the chatbot to answer my questions correctly
so that I don't have to wait for a human agent.
Acceptance: Chatbot answers questions correctly.
Points: 8

# Hypothesis-Driven Experiment Ticket (USE THIS)
## EXP-042: Fine-tune Llama-3-8B on support corpus
**Hypothesis:** Fine-tuning Llama-3-8B on our 50K support ticket
corpus will improve answer accuracy from 72% (baseline: zero-shot)
to >= 85% on the 500-question eval set.

**Method:**
- Dataset: 50K tickets, 80/10/10 train/val/test split
- Model: Llama-3-8B, QLoRA (rank 16, alpha 32)
- Training: 3 epochs, lr=2e-4, batch_size=4, gradient_accumulation=8
- Eval: accuracy, F1, latency on test set

**Compute budget:** 24 GPU-hours (1x A100, ~1 day)
**Time box:** 1 sprint (2 weeks including eval and documentation)

**Success criteria:**
- >= 85% accuracy on eval set (primary)
- Latency < 500ms per response (secondary)
- No regression on safety eval (blocking)

**Kill criteria:**
- Accuracy < 78% after full training = KILL
- Training loss doesn't decrease after epoch 1 = KILL (data issue)

**Outcome:** [TO BE FILLED AFTER EXPERIMENT]
"We don't know if this will work" is a valid estimate. The honesty is the value. When a researcher says "I think there's a 40% chance this works," that's actionable information. You can plan around it: run two experiments in parallel, have a fallback, or adjust stakeholder expectations. The worst thing you can do is force a researcher to commit to an outcome they can't guarantee.

DESIGN: The AI Sprint Board

A traditional sprint board has: To Do, In Progress, Done. For AI teams, you need more columns that reflect the experiment lifecycle:

ColumnWhat lives hereExit criteria
BacklogHypotheses not yet prioritizedTeam agrees to run it this sprint
HypothesisExperiment designed, compute budget approvedDataset ready, baseline measured
ExperimentTraining/running in progressRun complete, results logged
EvaluationResults being analyzed against success criteriaKill/iterate/promote decision made
IterationPromising experiment being refinedMeets success criteria or killed
ProductionModel promoted to production pipelineDeployed, monitored, signed off
KilledExperiments that didn't meet thresholdDocumented with learnings

CODE: Jira / Linear Configuration

yaml
# jira_ai_sprint_config.yaml
# Custom issue types for AI teams

issue_types:
  - name: Experiment
    icon: flask
    fields:
      - hypothesis: text          # What we're testing
      - method: text              # How we'll test it
      - baseline_metric: number   # Current best
      - target_metric: number     # What we need
      - compute_budget: text      # "24 GPU-hours"
      - time_box: text            # "1 sprint"
      - kill_criteria: text       # When to stop
      - outcome: select           # killed | promoted | iterating
      - final_metric: number      # What we achieved
      - learnings: text           # What we learned (REQUIRED on close)
      - wandb_link: url           # Link to experiment tracking

  - name: Data Task
    icon: database
    fields:
      - data_source: text
      - volume: text              # "10K examples"
      - quality_gate: text        # "Inter-annotator agreement > 0.8"
      - blocking_experiments: link # Which experiments need this data

  - name: ML Engineering
    icon: gear
    fields:
      - type: select              # infra | optimization | deployment
      - model_artifact: text      # Which model version
      - latency_target: text
      - cost_target: text

workflow:
  Experiment:
    - Backlog -> Hypothesis       # Prioritized in sprint planning
    - Hypothesis -> Experiment    # Dataset ready, baseline set
    - Experiment -> Evaluation    # Training complete
    - Evaluation -> Killed        # Didn't meet criteria
    - Evaluation -> Iteration     # Promising, needs refinement
    - Evaluation -> Production    # Meets all criteria
    - Iteration -> Evaluation     # Re-evaluate after refinement

# Sprint velocity is measured in EXPERIMENTS COMPLETED, not story points.
# A "completed" experiment is one with a kill/promote decision + documented learnings.

DESIGN: Sprint Ceremonies Adapted for AI

CeremonyTraditionalAI Adaptation
Sprint PlanningEstimate stories, commit to scopePrioritize experiments by expected information gain. Commit to running N experiments, not to outcomes.
Daily Standup"What did you do? What will you do? Blockers?""What did you learn? What's your next experiment? Are you stuck on data/compute/clarity?"
Sprint ReviewDemo features to stakeholdersShare experiment results. Show eval dashboards. Explain what we learned, not just what we built.
RetrospectiveProcess improvementProcess improvement + EXPERIMENT REVIEW: Which kills were good calls? Which should we have killed earlier? Are our hypotheses getting better?

DEBUG: When Sprint Planning Breaks

SymptomRoot causeFix
Velocity is wildly inconsistentMeasuring in story points on uncertain workSwitch to experiment throughput: completed experiments per sprint
Team always "almost done"No time boxes on experimentsEvery experiment gets a hard time box. At the deadline: kill, iterate, or promote.
Planning takes 4 hoursTrying to fully specify experiments upfrontSpecify hypothesis + success criteria only. Method details emerge during execution.
Researchers skip planningThey see it as bureaucracy that slows researchMake planning about THEIR priorities. "What do YOU want to try? What do you need from the team?"

FRONTIER: AI-Assisted Sprint Planning

LLM-powered ticket generation: Feed your experiment log to an LLM: "Given these 20 completed experiments and their outcomes, suggest the top 5 most promising next experiments ranked by expected information gain." The LLM can spot patterns humans miss — like the fact that all successful experiments used a learning rate below 1e-4.
Sprint Capacity Planner

Allocate your team's sprint capacity across experiment types. Adjust sliders to balance exploration, optimization, data work, and ML engineering.

Exploration 40%
Optimization 30%
Data Work 20%
ML Engineering 10%
A researcher estimates an experiment at "3 story points." When you ask what could go wrong, they say: "The training might diverge, the dataset might have label noise, or the model might overfit. Any of those would require starting over with a different approach." What's the right response?

Chapter 3: Managing Non-Determinism

Your QA engineer runs the chatbot test suite. Monday: 91% pass rate. Tuesday, same code, same model, same test suite: 88% pass rate. Wednesday: 93%. Nothing changed. The model is non-deterministic — given the same input, it can produce different outputs depending on sampling temperature, random seeds, and floating-point arithmetic. Welcome to the world where "it works on my machine" is literally true only that one time.

Non-determinism is the defining challenge that separates AI project management from traditional software. In traditional software, a test either passes or fails. In AI, a test passes 91% of the time, and your job is to decide if that's good enough.

CONCEPT: Defining "Done" on a Spectrum

In traditional Scrum, "done" is binary: the feature works or it doesn't. In AI, "done" is a set of thresholds across multiple dimensions:

yaml
# Definition of Done for AI features

model_quality:
  accuracy: ">= 90% on eval set (500 examples)"
  precision: ">= 88% (false positives are costly)"
  recall: ">= 85% (false negatives are acceptable)"
  latency_p95: "< 500ms"
  consistency: "Variance < 3% across 5 eval runs with different seeds"

safety:
  toxicity: "< 0.1% toxic outputs on safety eval (1000 adversarial prompts)"
  hallucination: "< 5% hallucinated facts on factual eval"
  pii_leakage: "0% PII in outputs"

regression:
  no_regression: "All metrics within 2% of previous production model"
  backward_compat: "Existing integrations produce equivalent outputs"

operational:
  monitoring: "Eval metrics dashboarded and alerting configured"
  rollback: "Can revert to previous model version in < 5 minutes"
  documentation: "Model card completed with known limitations"
Accuracy is a spectrum, not a checkbox. "The model is 87% accurate" means nothing without context. 87% accuracy on what dataset? Measured how? With what confidence interval? Compared to what baseline? Your job is to make these questions automatic. Every experiment ticket specifies exactly what "good enough" means BEFORE the experiment starts.

DESIGN: Eval-Driven Acceptance Criteria

Replace the traditional "acceptance criteria" (manual QA checkboxes) with eval-driven acceptance:

1. Define Eval Suite
Before the sprint starts, create a frozen eval set. Never train on it. This is your ground truth.
2. Set Thresholds
For each metric, define pass/fail thresholds. These go in the experiment ticket as success criteria.
3. Automate Eval
Run eval suite automatically after every training run. Results go to dashboard. No manual testing.
4. Statistical Significance
Run eval 5x with different seeds. Improvement must be statistically significant (p < 0.05), not just "higher this one time."
5. Regression Check
Compare against production model on the SAME eval set. Flag any regression > 2% on any metric.

CODE: Eval Pipeline Script

python
# eval_pipeline.py — Automated eval with statistical testing
import numpy as np
from scipy import stats
from dataclasses import dataclass
from typing import Dict, List

@dataclass
class EvalResult:
    metric_name: str
    scores: List[float]       # Multiple runs with different seeds
    threshold: float
    baseline_scores: List[float]

    @property
    def mean(self): return np.mean(self.scores)

    @property
    def std(self): return np.std(self.scores)

    @property
    def passes_threshold(self): return self.mean >= self.threshold

    @property
    def is_significant_improvement(self):
        # Welch's t-test: is the new model significantly better?
        t_stat, p_val = stats.ttest_ind(self.scores, self.baseline_scores,
                                         equal_var=False)
        return p_val < 0.05 and t_stat > 0

    @property
    def has_regression(self):
        # Is the new model significantly WORSE?
        t_stat, p_val = stats.ttest_ind(self.scores, self.baseline_scores,
                                         equal_var=False)
        return p_val < 0.05 and t_stat < 0

def run_eval_gate(results: List[EvalResult]) -> Dict:
    """Returns go/no-go decision for production promotion."""
    report = {"pass": True, "details": []}
    for r in results:
        detail = {
            "metric": r.metric_name,
            "mean": f"{r.mean:.3f} +/- {r.std:.3f}",
            "threshold": r.threshold,
            "passes": r.passes_threshold,
            "significant": r.is_significant_improvement,
            "regression": r.has_regression
        }
        if not r.passes_threshold or r.has_regression:
            report["pass"] = False
        report["details"].append(detail)
    return report

DEBUG: Regression Detection

The scariest moment in AI development: the new model is better on your target metric but worse on something you didn't check. This is silent regression.

Regression typeHow to detectPrevention
Metric regressionEval suite catches itRun FULL eval suite, not just target metric
Distribution shiftModel degrades on a subpopulationSlice eval by category (e.g., test per language, per topic)
Latency regressionBigger model = slower inferenceInclude latency in eval gate criteria
Safety regressionNew model generates toxic contentSafety eval is a BLOCKING gate, not optional
Behavioral regressionModel answers differently but "correctly" — breaks downstreamGolden test set: 50 hand-picked examples that must match exactly

FRONTIER: Continuous Eval

Eval-as-CI: The frontier is treating model evaluation like continuous integration. Every commit to the training pipeline triggers an eval run. Results are reported as PR checks. A model can't be merged to production if eval scores drop. Tools like Evidently AI and Giskard are building this into standard MLOps workflows.
Eval Variance Visualizer

See how model accuracy varies across evaluation runs. Each run uses a different random seed. Adjust the variance slider to simulate different model stability levels. Green zone = passing threshold.

Model Variance 3%
Pass Threshold 90%
Your model scores 91% accuracy on Monday and 88% on Tuesday, with no code changes. The pass threshold is 90%. A stakeholder says "it passed Monday, ship it." What is the correct response?

Chapter 4: Data Pipeline Management

It is sprint planning. The ML engineer says: "I can start the experiment as soon as the labeled data is ready." The data lead says: "We sent 5,000 examples to the labeling vendor last week. They said 7-10 business days." The sprint is 10 business days. The experiment needs labeled data by day 3 to have time for training and evaluation. The math doesn't work. And nobody realized it until just now.

Data is the #1 blocker for AI teams. Not compute, not model architecture, not engineering talent. Data. Specifically: getting enough high-quality, correctly labeled data to the right team at the right time. Your job as AI Scrum Master is to make data readiness visible and plan around it.

CONCEPT: The Data Supply Chain

Think of data like a supply chain in manufacturing. You need raw materials (unlabeled data), processing (annotation), quality control (validation), and delivery (versioned datasets). Any disruption at any stage blocks everything downstream.

StageLead timeCommon blockersYour mitigation
Collection1-4 weeksLegal approval for scraping, API rate limits, privacy reviewStart collection 2 sprints before experiments need it
Annotation1-3 weeksVendor capacity, unclear guidelines, low agreementWrite annotation guides BEFORE sending to vendors
Validation2-5 daysQuality issues requiring re-labelingSpot-check first 100 labels before approving full batch
Versioning1 dayNo versioning = "which dataset did you train on?"DVC or similar tool, version every dataset change
Plan data 2 sprints ahead. If your experiments need new labeled data, you must initiate the data pipeline at least 2 sprints before the experiment sprint. This means your sprint planning needs a "data lookahead" section: "What data will sprint N+2 need, and what do we need to start NOW to have it ready?"

DESIGN: Data Readiness Board

Add a parallel track to your sprint board specifically for data:

yaml
# data_readiness_board.yaml

columns:
  - name: "Data Needed"
    description: "Experiment requires data that doesn't exist yet"
    cards_include: experiment_id, data_type, volume, deadline

  - name: "Collection"
    description: "Raw data being gathered"
    cards_include: source, method, legal_approval, eta

  - name: "Annotation"
    description: "Data sent to labeling vendor/team"
    cards_include: vendor, volume, guidelines_link, eta, cost

  - name: "QA"
    description: "Labeled data being validated"
    cards_include: sample_checked, agreement_score, issues_found

  - name: "Ready"
    description: "Versioned, validated, available in data store"
    cards_include: version, location, row_count, quality_score

# Key metric: Data Readiness Rate
# = (experiments with data ready on time) / (total experiments planned)
# Target: >= 80%. Below 60% = systemic planning failure.

CODE: Data Quality Monitoring

python
# data_quality_monitor.py — Track annotation quality in real time
import numpy as np
from collections import Counter
from dataclasses import dataclass
from typing import List, Dict

@dataclass
class AnnotationBatch:
    batch_id: str
    vendor: str
    total_examples: int
    labels: List[Dict]         # [{"id": "ex_001", "label": "positive", "annotator": "a1"}, ...]
    double_labels: List[Dict]  # Same examples labeled by 2 annotators

    @property
    def inter_annotator_agreement(self) -> float:
        """Cohen's kappa between annotators on double-labeled examples."""
        if not self.double_labels:
            return 0.0
        agreements = sum(
            1 for d in self.double_labels
            if d["label_a"] == d["label_b"]
        )
        p_observed = agreements / len(self.double_labels)
        # Simplified kappa (full version accounts for chance agreement)
        label_counts = Counter(d["label_a"] for d in self.double_labels)
        total = len(self.double_labels)
        p_chance = sum((c/total)**2 for c in label_counts.values())
        if p_chance == 1.0:
            return 1.0
        return (p_observed - p_chance) / (1 - p_chance)

    @property
    def label_distribution(self) -> Dict[str, float]:
        """Check for label imbalance."""
        counts = Counter(l["label"] for l in self.labels)
        total = sum(counts.values())
        return {k: v/total for k, v in counts.items()}

    def quality_report(self) -> Dict:
        iaa = self.inter_annotator_agreement
        dist = self.label_distribution
        issues = []
        if iaa < 0.6:
            issues.append(f"LOW AGREEMENT: kappa={iaa:.2f} (need >= 0.8)")
        for label, pct in dist.items():
            if pct > 0.9 or pct < 0.05:
                issues.append(f"IMBALANCE: '{label}' is {pct:.0%} of labels")
        return {
            "batch_id": self.batch_id,
            "agreement": iaa,
            "distribution": dist,
            "issues": issues,
            "status": "PASS" if not issues else "FAIL"
        }

DEBUG: When Data Blocks the Sprint

ProblemDiagnosisEmergency fixSystemic fix
Vendor missed delivery dateScope was unclear, vendor underestimatedUse partial batch, reduce experiment scopeSend 10% pilot batch first, validate quality and timeline
Labels are low quality (kappa < 0.6)Annotation guidelines are ambiguousRe-label a subset with your own teamCreate detailed guidelines with examples, run calibration sessions
Data has PII that blocks legal reviewNobody checked before sending to vendorApply PII scrubbing pipeline, re-annotatePII check is a gate BEFORE annotation starts
"Which dataset did we train on?"No data versioningHash the dataset file, log it in experiment trackerDVC + data registry + version in every experiment config

FRONTIER: Synthetic Data and Active Learning

Synthetic data generation: Use LLMs to generate training data. GPT-4 can create thousands of labeled examples for pennies. The catch: synthetic data biases toward the generating model's worldview. Always validate synthetic data against real-world eval sets. Best practice: 80% real data + 20% synthetic for edge cases.
Active learning: Instead of labeling 10K random examples, train a model on 1K, then have it flag the examples it's most uncertain about. Label those first. Active learning can achieve the same accuracy with 3x less labeled data — saving weeks of annotation time.
Data Pipeline Timeline

Visualize the data pipeline stages and their lead times. Red sections indicate blockers. Green = on track. Click Add Blocker to simulate common disruptions.

Your team needs 10,000 labeled examples for next sprint's experiment. The labeling vendor says delivery will take 12 business days. Your sprint is 10 days. What is the BEST approach?

Chapter 5: Experiment Tracking & Visibility

The VP asks: "How close are we to shipping the model?" You pull up the researcher's Jupyter notebook. There are 47 cells with names like "test_v3_final_FINAL_2." The loss curve is in a matplotlib plot embedded in cell 23. The best hyperparameters are in a comment on cell 31. The eval results are in a Slack message from last Thursday. This is not experiment tracking. This is chaos.

Experiment tracking is the practice of systematically recording every experiment's configuration, results, and artifacts so that anyone on the team (including future-you) can reproduce, compare, and build on past work. For the Scrum Master, it's also the source of truth for sprint progress. You don't ask "how's the experiment going?" — you look at the dashboard.

CONCEPT: The Experiment Log as Source of Truth

Every experiment produces three types of artifacts that must be tracked:

Artifact typeExamplesWhy it matters
ConfigurationModel, hyperparameters, dataset version, code commitReproducibility: can you rerun this exact experiment?
MetricsLoss curves, accuracy, F1, latency, costComparison: is this better than the last experiment?
ArtifactsModel weights, eval predictions, error analysisPromotion: which model file goes to production?
The experiment log replaces the burndown chart. In traditional Scrum, the burndown chart shows progress. In AI Scrum, the experiment log is your burndown. Each row is a completed experiment. The metric column shows whether you're converging on your target. The decision column shows whether the team is making good kill/continue calls. This is what you show in sprint review.

DESIGN: Dashboards for Different Audiences

yaml
# experiment_dashboards.yaml — Three views, one data source

# 1. RESEARCHER VIEW (W&B / MLflow)
researcher_dashboard:
  charts:
    - loss_curves: "Training + validation loss over epochs, per experiment"
    - hyperparameter_sweep: "Parallel coordinates plot of lr, batch_size, etc."
    - confusion_matrix: "Per-class performance on eval set"
    - error_examples: "Top 20 hardest examples the model gets wrong"
  filters: [model_type, dataset_version, date_range]
  refresh: real-time

# 2. SCRUM MASTER VIEW (synthesized from W&B data)
scrum_dashboard:
  cards:
    - experiments_this_sprint: {total: 5, completed: 3, killed: 1, active: 1}
    - best_accuracy: {current: "88.3%", target: "90%", gap: "1.7%"}
    - compute_budget: {used: "72 GPU-hrs", total: "120 GPU-hrs", pct: "60%"}
    - data_readiness: {ready: 3, in_progress: 1, blocked: 1}
  table:
    columns: [experiment_id, hypothesis, status, best_metric, decision]
  refresh: hourly

# 3. STAKEHOLDER VIEW (executive summary)
stakeholder_dashboard:
  cards:
    - project_phase: "Convergence (Week 6 of 16)"
    - headline_metric: "Best model: 88.3% accuracy (target: 90%)"
    - confidence: "High — on track to hit 90% by week 10"
    - next_milestone: "Sprint 4 Review — May 30"
    - risks: "Data vendor delay on safety eval set"
  chart:
    - accuracy_over_time: "Weekly best accuracy with trend line"
  refresh: weekly

CODE: Experiment Logger Integration

python
# experiment_logger.py — Wraps W&B/MLflow for sprint tracking
import wandb
from datetime import datetime
from typing import Optional

class SprintExperimentLogger:
    def __init__(self, project: str, sprint_id: str):
        self.sprint_id = sprint_id
        self.project = project

    def start_experiment(self, config: dict) -> str:
        """Start a tracked experiment with sprint metadata."""
        run = wandb.init(
            project=self.project,
            config={
                **config,
                "sprint_id": self.sprint_id,
                "started_at": datetime.now().isoformat(),
                "hypothesis": config.get("hypothesis", "Not specified"),
                "compute_budget_hrs": config.get("compute_budget_hrs", 0),
                "kill_criteria": config.get("kill_criteria", "Not specified"),
            },
            tags=[f"sprint-{self.sprint_id}", config.get("model_type", "unknown")]
        )
        return run.id

    def log_decision(self, experiment_id: str, decision: str,
                     final_metric: float, learnings: str):
        """Log the kill/promote/iterate decision."""
        wandb.log({
            "decision": decision,           # killed | promoted | iterating
            "final_metric": final_metric,
            "learnings": learnings,
            "decided_at": datetime.now().isoformat()
        })
        # Also update the Jira ticket via API
        self._update_jira_ticket(experiment_id, decision, final_metric)

    def sprint_summary(self) -> dict:
        """Generate sprint review summary from experiment data."""
        api = wandb.Api()
        runs = api.runs(self.project,
                       filters={"config.sprint_id": self.sprint_id})
        summary = {
            "total": 0, "killed": 0, "promoted": 0,
            "iterating": 0, "active": 0,
            "best_metric": 0, "learnings": []
        }
        for run in runs:
            summary["total"] += 1
            decision = run.summary.get("decision", "active")
            summary[decision] += 1
            metric = run.summary.get("final_metric", 0)
            if metric > summary["best_metric"]:
                summary["best_metric"] = metric
        return summary

Translating Experiments to Business Metrics

Stakeholders don't care about accuracy. They care about outcomes. Your job is to translate:

ML metricBusiness translationHow to present
Accuracy: 72% → 88%"We went from deflecting 72% of support tickets to 88% — that's 1,600 fewer human-handled tickets per week"Show the dollar savings: 1,600 tickets × $5 avg cost = $8K/week saved
Latency: 2s → 400ms"Customer wait time dropped from 2 seconds to under half a second"Show before/after UX recording
Hallucination rate: 8% → 2%"The chatbot now gives wrong answers 1 in 50 times instead of 1 in 12"Show specific examples of prevented hallucinations

FRONTIER: Automated Experiment Summarization

LLM-powered experiment summaries: Feed your W&B experiment logs to an LLM at the end of each sprint. Generate: (1) Plain-English summary of what was tried and learned. (2) Recommendations for next sprint's experiments. (3) Risk assessment: "At current trajectory, we'll hit target in 3 sprints." This becomes your sprint review presentation.
Experiment Dashboard

A simulated experiment tracking dashboard. Each bar represents an experiment's best metric. Green = promoted, red = killed, yellow = active. Click Run Experiment to add results.

The VP asks "How's the AI project going?" Your best model accuracy is 88.3% against a 90% target. What is the most useful response?

Chapter 6: Research-to-Production Handoffs

The researcher posts in Slack: "Model is ready! 92% accuracy! Here's the notebook." The ML engineer opens the notebook. It imports from a local path that doesn't exist on the production server. The data preprocessing uses a different tokenizer than the inference pipeline. The model was trained on Python 3.11 with PyTorch 2.1, but production runs Python 3.9 with PyTorch 1.13. The "92% accuracy" was measured on a test set that accidentally overlapped with the training set. This is the "works in notebook, breaks in prod" problem, and it kills more AI projects than bad models.

CONCEPT: The Notebook-to-Production Gap

Research and production have fundamentally different requirements. A researcher optimizes for speed of iteration. A production engineer optimizes for reliability and scale. The gap between them is where AI projects die.

DimensionResearchProductionThe gap
CodeJupyter notebook, quick and dirtyTested, typed, packaged Python modulesRewrite everything
DataLocal CSV, ad-hoc preprocessingVersioned datasets, pipeline DAGsDifferent preprocessing = different results
DepsWhatever pip installed todayLocked requirements, container imagesVersion conflicts, CUDA mismatches
InfraSingle GPU, batch processingMulti-GPU, real-time, auto-scaling10x latency at scale
Eval"Looks good on my test set"Automated eval suite, A/B test in prodOffline eval ≠ online performance
The handoff checklist is your most important artifact. Every model that crosses the research-to-production boundary must pass through a checklist. No exceptions. No "we'll fix it in production." The checklist catches 80% of production failures before they happen.

DESIGN: The Production Readiness Checklist

markdown
# Model Production Readiness Checklist

## 1. Reproducibility
- [ ] Training code runs from a single command (not a notebook)
- [ ] All dependencies pinned in requirements.txt / pyproject.toml
- [ ] Docker image builds and runs successfully
- [ ] Random seeds documented; results reproducible within 1%
- [ ] Dataset version tracked (DVC hash or equivalent)

## 2. Evaluation
- [ ] Eval suite passes on PRODUCTION eval set (not training set)
- [ ] No data leakage: train/eval sets verified disjoint
- [ ] Metrics run 5x with different seeds (variance documented)
- [ ] Compared against current production model (no regression)
- [ ] Safety eval passes (toxicity, hallucination, PII)
- [ ] Latency measured under production-like load

## 3. Integration
- [ ] Input/output schema matches API contract
- [ ] Preprocessing pipeline is IDENTICAL to training preprocessing
- [ ] Model serves via the production serving framework (TorchServe, vLLM, etc.)
- [ ] Error handling for malformed inputs
- [ ] Graceful degradation when model times out

## 4. Operational Readiness
- [ ] Model card written (purpose, limitations, biases)
- [ ] Monitoring dashboards configured (accuracy, latency, error rate)
- [ ] Alerting rules set (accuracy drops > 5%, latency p99 > 2x)
- [ ] Rollback procedure tested (revert to previous model in < 5 min)
- [ ] A/B test configured (serve new model to 5%, measure, then ramp)

## 5. Sign-offs
- [ ] ML researcher: "Model meets eval criteria"
- [ ] ML engineer: "Inference pipeline passes integration tests"
- [ ] Data engineer: "Data pipeline feeds correct data"
- [ ] Product manager: "Feature meets user requirements"
- [ ] Security/Legal: "Model complies with policies" (if applicable)

CODE: Handoff Automation

python
# handoff_validator.py — Automated checks for production readiness
import subprocess
import json
from pathlib import Path

class HandoffValidator:
    def check_reproducibility(self, model_dir: Path) -> dict:
        checks = {}
        # 1. requirements.txt exists and is pinned
        req_file = model_dir / "requirements.txt"
        checks["deps_pinned"] = req_file.exists() and all(
            "==" in line for line in req_file.read_text().strip().split("\n")
            if line and not line.startswith("#")
        )
        # 2. Dockerfile exists
        checks["dockerfile"] = (model_dir / "Dockerfile").exists()
        # 3. No notebooks in production code
        checks["no_notebooks"] = not any(model_dir.glob("**/*.ipynb"))
        # 4. Data version tracked
        checks["data_versioned"] = (
            (model_dir / ".dvc").exists() or
            (model_dir / "data_version.json").exists()
        )
        return checks

    def check_eval_integrity(self, eval_config: dict) -> dict:
        checks = {}
        # Verify train/eval sets are disjoint
        train_ids = set(eval_config["train_ids"])
        eval_ids = set(eval_config["eval_ids"])
        overlap = train_ids & eval_ids
        checks["no_data_leakage"] = len(overlap) == 0
        if overlap:
            checks["leaked_ids"] = list(overlap)[:10]
        return checks

    def full_check(self, model_dir: Path, eval_config: dict) -> dict:
        repro = self.check_reproducibility(model_dir)
        eval_int = self.check_eval_integrity(eval_config)
        all_checks = {**repro, **eval_int}
        passed = all(v if isinstance(v, bool) else True for v in all_checks.values())
        return {"passed": passed, "checks": all_checks}

DEBUG: Common Handoff Failures

FailureRoot causeDetectionPrevention
Model accuracy drops 10% in prodPreprocessing differs between training and servingRun eval suite through the SERVING pipeline, not research pipelineShare preprocessing code between training and serving
Model loads but crashes on edge casesInput validation missing in serving codeFuzz testing with malformed inputsInput schema validation in serving layer
Latency 5x slower than expectedResearch used batch processing; prod needs single-requestLoad test before promotionLatency target in experiment ticket
"92% accuracy" was on contaminated evalTrain/eval overlapHandoff validator catches itEval set is created and frozen before ANY training begins

FRONTIER: MLOps as Code

The handoff disappears when research and production share infrastructure. Tools like MLflow Model Registry, Vertex AI Pipelines, and Amazon SageMaker create a continuous path from experiment to deployment. The researcher promotes a model version. The CI/CD pipeline runs the eval suite, builds the container, and deploys with canary rollout. No human handoff needed. The Scrum Master monitors the pipeline instead of coordinating people.
Research-to-Production Pipeline

Watch a model move from research to production. Each gate checks a specific requirement. Red gates block deployment. Click Promote Model to start the handoff.

A researcher says their model achieves 95% accuracy and is "ready for production." What is the FIRST thing you should verify?

Chapter 7: Stakeholder Communication for AI

The CEO walks into your sprint review. The team just achieved 87% accuracy on the customer support chatbot. The CEO asks: "Is 87% good?" The researcher starts explaining F1 scores and confusion matrices. The CEO's eyes glaze over. You step in: "It means the chatbot gives the right answer 87 times out of 100. Our target is 92 — right now, 13 out of 100 customers would get a wrong answer, which would frustrate them. We need five more percentage points. Based on our experiments, we'll get there in about three weeks."

That's the job. Translating uncertainty into actionable information that non-technical people can make decisions with. It's possibly the single most valuable skill an AI Scrum Master has.

CONCEPT: The Three Lies of AI Timelines

Stakeholders ask three questions. Each one invites a lie:

QuestionThe lieThe truthHow to say it
"When will it be ready?""End of Q2"We don't know. AI timelines are probabilistic."Based on our current trajectory, there's a 70% chance we hit the target by end of Q2. The 30% risk is data quality issues."
"How accurate is it?""87% accurate"87% on our eval set — which might not represent real-world usage."87% on our test set of 500 questions. We expect 80-85% in production because real users ask harder questions."
"Can you just add [feature]?""Sure, next sprint"Each new capability requires a new eval suite, new data, new experiments."We can prototype it, but validating it to production quality is 4-6 weeks of experiments."
"The model is 87% accurate" means nothing without context. Always provide: (1) Accuracy on WHAT data? (2) Compared to WHAT baseline? (3) What does a wrong answer LOOK LIKE? An 87% accurate spam filter is fine. An 87% accurate medical diagnosis system is dangerous. Context determines whether 87% is a celebration or a crisis.

DESIGN: The Stakeholder Communication Framework

markdown
# AI Project Status Report Template

## One-Line Summary
[Current metric] / [Target metric] — [Trajectory statement]
Example: "88% / 92% — On track for 92% in Sprint 7 (3 weeks)"

## Progress Since Last Report
- Experiments completed: 4 (2 killed, 1 promoted, 1 iterating)
- Key learning: [What we learned that changes our approach]
- Best metric improvement: [X% → Y%] via [what technique]

## Risks and Blockers
| Risk | Impact | Likelihood | Mitigation |
|------|--------|------------|------------|
| Data vendor delay | 1 sprint slip | Medium | Using synthetic data as bridge |
| GPU shortage | Can't run 2 experiments in parallel | Low | Reserved spot instances |

## What We Need from Leadership
- [ ] Approval for $3K additional labeling budget
- [ ] Decision: ship at 90% or wait for 92%?
- [ ] Legal review scheduled before launch date

## Next Milestone
[What we'll demonstrate at the next sprint review]

Managing Timeline Expectations

Use confidence cones instead of single-point estimates. A confidence cone shows the range of possible outcomes:

AI Project Confidence Cone

Visualize how confidence narrows as the project progresses. Early estimates have wide ranges. As experiments provide data, the range shrinks. Adjust the progress slider to see how uncertainty decreases.

Project Week 1

CODE: Automated Status Report Generator

python
# status_report.py — Generate stakeholder-friendly status from experiment data
from datetime import datetime

def generate_status_report(experiments: list, target_metric: float,
                            sprint_num: int, total_sprints: int) -> str:
    completed = [e for e in experiments if e["status"] != "active"]
    best = max((e["best_metric"] for e in completed), default=0)
    gap = target_metric - best
    killed = len([e for e in completed if e["status"] == "killed"])
    promoted = len([e for e in completed if e["status"] == "promoted"])

    # Estimate sprints to target based on improvement rate
    metrics_by_sprint = {}  # Group best metric per sprint
    for e in completed:
        s = e.get("sprint", sprint_num)
        if s not in metrics_by_sprint or e["best_metric"] > metrics_by_sprint[s]:
            metrics_by_sprint[s] = e["best_metric"]

    if len(metrics_by_sprint) >= 2:
        sprints = sorted(metrics_by_sprint.keys())
        improvement_per_sprint = (
            metrics_by_sprint[sprints[-1]] - metrics_by_sprint[sprints[0]]
        ) / (sprints[-1] - sprints[0])
        if improvement_per_sprint > 0:
            sprints_to_target = gap / improvement_per_sprint
            eta = f"~{sprints_to_target:.0f} sprints at current rate"
        else:
            eta = "STALLED — improvement rate is zero"
    else:
        eta = "Insufficient data for estimate"

    return f"""
# Sprint {sprint_num} Status ({datetime.now().strftime('%B %d')})
Best metric: {best:.1%} / Target: {target_metric:.1%} (gap: {gap:.1%})
ETA to target: {eta}
Experiments: {len(completed)} completed ({killed} killed, {promoted} promoted)
"""

DEBUG: When Stakeholder Communication Breaks

SymptomRoot causeFix
Executives surprise-ask for features mid-sprintThey don't understand the experiment cycleEducate: "Each new capability = 2-4 sprint experiment cycle"
PM overpromises to customersYou gave a best-case estimate without the rangeAlways give confidence ranges: "70% chance by June, 90% by July"
Team is demoralized by "failed" experimentsSuccess is framed as accuracy numbers, not learningReframe: every sprint review starts with "What we learned"
Board thinks AI is a waste of moneyNo connection between experiments and business valueTranslate every metric to dollars or customer impact

FRONTIER: AI-Generated Stakeholder Reports

LLM-powered reporting: Your experiment tracking system feeds into an LLM that generates weekly stakeholder updates. It reads the experiment logs, identifies key learnings, calculates trajectory, and writes the report in executive-friendly language. You review and send. This saves 2-3 hours per week and ensures consistency.
Your AI model achieves 87% accuracy. The target is 92%. A stakeholder asks: "Can we ship at 87%?" What information do you need to provide for a good decision?

Chapter 8: GenAI/LLM-Specific Scrum

Your team is building a customer support chatbot powered by an LLM. The "code" is a 500-word system prompt. The "testing" is running 200 customer questions and having three humans grade the answers. The "deployment" is changing an API key from GPT-3.5-turbo to GPT-4o. The "performance optimization" is rewriting a paragraph of the prompt. Nothing about this looks like traditional software development, and your sprint process needs to reflect that.

CONCEPT: Prompt Engineering as a Sprint Activity

Prompt engineering sprints are the new unit of work for GenAI teams. A single prompt change can shift model behavior more than weeks of fine-tuning. But prompt changes are also unpredictable — a change that improves one capability can degrade another.

ActivityTraditional ML equivalentSprint time
System prompt iterationArchitecture search2-5 days per major revision
Few-shot example curationTraining data curation1-3 days
RAG pipeline tuningFeature engineering1-2 weeks
Eval set creationTest suite authoring3-5 days (ongoing)
Model migration (GPT-4 → Claude)Framework migration2-4 weeks (prompt rewriting + re-eval)
Fine-tuningModel training1-2 weeks (data prep + training + eval)
Eval-driven development for LLMs. In traditional software, you write tests first, then code. In GenAI, you write evals first, then prompts. Every prompt change MUST be validated against the eval set before merging. This is non-negotiable. A "small prompt tweak" can catastrophically degrade one category of answers while improving another.

DESIGN: The GenAI Sprint Structure

yaml
# genai_sprint_template.yaml

sprint_week_1:
  monday:
    - Review last sprint's eval results
    - Prioritize prompt improvements by impact
    - Assign RAG pipeline experiments

  tuesday_thursday:
    - Prompt engineering: iterate on system prompt
    - RAG experiments: test different chunking, retrieval, reranking
    - Run eval suite after EACH significant change
    - Daily eval check-in: "What moved? What regressed?"

  friday:
    - Eval freeze: run full eval suite on best candidates
    - Document prompt changelog (version control the prompt!)

sprint_week_2:
  monday_wednesday:
    - A/B test top 2 prompt versions with real traffic (5% canary)
    - Fine-tuning experiment (if applicable)
    - RAG pipeline: index new documents, test retrieval quality

  thursday:
    - Production promotion decision
    - Sprint review prep: compile eval results + business metrics

  friday:
    - Sprint review: show before/after on key scenarios
    - Retrospective: what eval gaps did we discover?
    - Plan next sprint's eval set improvements

CODE: Prompt Version Control

python
# prompt_registry.py — Version control for prompts
import json
import hashlib
from datetime import datetime
from pathlib import Path

class PromptRegistry:
    def __init__(self, registry_path: str = "prompts/"):
        self.path = Path(registry_path)
        self.path.mkdir(exist_ok=True)

    def register(self, name: str, prompt: str,
                  metadata: dict = None) -> str:
        """Register a prompt version with hash-based versioning."""
        version = hashlib.sha256(prompt.encode()).hexdigest()[:8]
        record = {
            "name": name,
            "version": version,
            "prompt": prompt,
            "created_at": datetime.now().isoformat(),
            "metadata": metadata or {},
            "char_count": len(prompt),
            "word_count": len(prompt.split()),
        }
        filepath = self.path / f"{name}_v{version}.json"
        filepath.write_text(json.dumps(record, indent=2))
        return version

    def compare(self, name: str, v1: str, v2: str) -> dict:
        """Diff two prompt versions."""
        p1 = json.loads((self.path / f"{name}_v{v1}.json").read_text())
        p2 = json.loads((self.path / f"{name}_v{v2}.json").read_text())
        return {
            "v1_words": p1["word_count"],
            "v2_words": p2["word_count"],
            "delta_words": p2["word_count"] - p1["word_count"],
            "v1_date": p1["created_at"],
            "v2_date": p2["created_at"],
        }

DESIGN: RAG Pipeline Sprint Workflow

RAG (Retrieval-Augmented Generation) is the backbone of most production GenAI applications. Each component of the RAG pipeline is a separate experiment axis:

Chunking Strategy
How documents are split: fixed-size, semantic, recursive. Each choice affects retrieval quality.
Embedding Model
Which model converts text to vectors. Trade-offs: speed vs. quality vs. cost. OpenAI, Cohere, open-source.
Retrieval
Vector search, BM25, hybrid. Top-k selection. Reranking with cross-encoders.
Synthesis
System prompt + retrieved context + user query = LLM response. Prompt template matters enormously.
Test one variable at a time. The biggest mistake in RAG development is changing the chunking strategy, embedding model, AND prompt template simultaneously. You can't learn anything because you don't know what caused the change. Run experiments that isolate one variable. This is slower but produces actionable results.

Model Migration Planning

Every 6-12 months, a new model generation launches (GPT-4 → GPT-4o → GPT-5, Claude 3 → Claude 4). Migration is a multi-sprint project:

yaml
# model_migration_plan.yaml — Claude 3.5 → Claude 4

sprint_1_eval:
  - Run FULL eval suite on new model with existing prompts
  - Identify regressions (new model is different, not just better)
  - Benchmark latency and cost differences
  - Decision: is the upgrade worth the migration effort?

sprint_2_prompt_adaptation:
  - Rewrite prompts for new model's capabilities/quirks
  - New model may need less hand-holding (remove workarounds)
  - New model may have different failure modes (add guardrails)
  - Run eval suite after each prompt revision

sprint_3_integration:
  - Update API integration (new endpoints, parameters)
  - Update token budgets (new model may have different context window)
  - Load test under production traffic patterns
  - Canary deployment: 5% traffic to new model

sprint_4_rollout:
  - Monitor canary for 1 week
  - Ramp to 50%, then 100%
  - Keep old model warm for 2 weeks (rollback safety)
  - Update documentation and model card

FRONTIER: Multi-Model Orchestration

Router models: The frontier isn't using one LLM for everything. It's using a cheap, fast model (GPT-3.5/Haiku) for simple queries and routing complex ones to expensive models (GPT-4/Opus). Sprint planning for multi-model systems involves optimizing the router as well as the models. This is a new experiment axis that didn't exist two years ago.
GenAI Sprint Eval Tracker

Track prompt versions and their eval scores across sprints. Each bar is a prompt version. Click New Prompt Version to simulate iterating on the system prompt.

Your team changes the system prompt to improve the chatbot's handling of refund questions. Refund accuracy goes from 78% to 91%. But you notice the general Q&A accuracy dropped from 89% to 82%. What is the correct sprint action?

Chapter 9: Agentic AI Project Management

Your team is building an AI agent that can research a topic, write a report, and email it to a customer. In testing, the agent works beautifully 85% of the time. The other 15%? It sends emails to the wrong person. It cites sources that don't exist. It writes a report about the wrong topic because it misinterpreted the request. And once, memorably, it entered an infinite loop and sent 47 emails before anyone noticed.

Agentic AI — systems where an LLM takes actions in the real world (calling APIs, executing code, modifying databases, sending communications) — is the most unpredictable type of AI project to manage. The failure modes aren't just "wrong answer." They're "wrong action with real-world consequences."

CONCEPT: Why Agents Are Harder to Manage

DimensionTraditional MLChatbot/LLMAgentic AI
Failure modeWrong predictionWrong answerWrong ACTION (sends email, deletes data, charges money)
Blast radiusOne user sees wrong resultOne user gets bad answerAgent modifies external systems irreversibly
TestingEval set, accuracy metricsHuman grading, automated evalsEnd-to-end trajectory testing, sandbox environments
DebuggingCheck model weights, featuresRead the prompt, check contextTrace multi-step reasoning across tool calls
Sprint predictabilityLow (experiments)Medium (prompt iteration is fast)Very low (emergent behavior from tool combinations)
Agent development is iterative and unpredictable. An agent's behavior emerges from the interaction between its prompt, its tools, and its reasoning. You can't predict what an agent will do just by reading its code. You discover behavior by testing — extensively, in sandbox environments, with real-world-like scenarios. Plan your sprints around testing, not building.

DESIGN: Agent Development Sprint Structure

yaml
# agent_sprint_structure.yaml

# Phase 1: Tool Integration (1-2 sprints per tool)
tool_sprints:
  each_tool:
    - Define tool's API contract (input/output schemas)
    - Implement tool with error handling and rate limiting
    - Write unit tests for the tool in isolation
    - Write integration test: agent calls tool correctly
    - Write adversarial test: agent handles tool failure gracefully
    - Safety review: what happens if agent misuses this tool?

# Phase 2: Behavior Testing (ongoing, every sprint)
behavior_testing:
  trajectory_tests:
    - Define 50+ test scenarios with expected action sequences
    - Run agent in sandbox, record full trajectory
    - Grade: correct actions? correct order? no harmful actions?
    - Regression test: adding tool B didn't break tool A behavior

  adversarial_tests:
    - Prompt injection: user tries to make agent do unauthorized actions
    - Edge cases: what if the tool returns an error?
    - Loops: does the agent ever enter infinite tool-calling loops?
    - Scope creep: does the agent stay within its defined capabilities?

# Phase 3: Safety Review Gates
safety_gates:
  before_sandbox: "Agent can only call mock tools"
  before_staging: "Agent calls real tools but in test environment"
  before_production: "Full safety review, rate limits, kill switch"

CODE: Agent Trajectory Logger

python
# agent_trajectory.py — Log and analyze agent action sequences
from dataclasses import dataclass, field
from typing import List, Optional
from datetime import datetime

@dataclass
class AgentStep:
    step_num: int
    thought: str              # Agent's reasoning
    tool_name: str            # Which tool it called
    tool_input: dict          # What it passed to the tool
    tool_output: str          # What the tool returned
    timestamp: datetime = field(default_factory=datetime.now)

@dataclass
class Trajectory:
    task: str
    steps: List[AgentStep] = field(default_factory=list)
    final_output: Optional[str] = None
    success: Optional[bool] = None

    @property
    def tool_sequence(self) -> List[str]:
        return [s.tool_name for s in self.steps]

    @property
    def has_loop(self) -> bool:
        """Detect if agent repeated the same tool call 3+ times."""
        for i in range(len(self.steps) - 2):
            if (self.steps[i].tool_name == self.steps[i+1].tool_name
                    == self.steps[i+2].tool_name):
                if (self.steps[i].tool_input == self.steps[i+1].tool_input
                        == self.steps[i+2].tool_input):
                    return True
        return False

    @property
    def unauthorized_actions(self) -> List[AgentStep]:
        """Flag steps that used tools outside the allowed set."""
        allowed = {"search", "read_doc", "write_report", "send_email"}
        return [s for s in self.steps if s.tool_name not in allowed]

Multi-Agent Coordination

Some systems use multiple agents that collaborate: a planner agent, a researcher agent, a writer agent, and a reviewer agent. Coordinating multi-agent systems in sprints requires treating agent interactions as integration points:

Sprint activitySingle agentMulti-agent
TestingTest one agent's behaviorTest agent HANDOFFS: does agent A's output format match agent B's expected input?
DebuggingRead one trajectoryTrace across agent boundaries: which agent introduced the error?
PlanningOne set of capabilitiesDependency graph: agent C can't be developed until agent A's API stabilizes
SafetyOne agent's action spaceEmergent behavior: agents A and B are safe alone, but together they escalate privileges

DEBUG: Agent Failure Analysis

Failure modeHow to detectHow to fix
Infinite loopStep count exceeds max (e.g., 20 steps)Hard step limit + loop detection in trajectory logger
Wrong tool selectionTrajectory shows agent used "delete" when it should have used "update"Better tool descriptions, few-shot examples in prompt
Scope creepAgent performs actions not requested by userExplicit instruction: "Only perform actions the user specifically requested"
PII exposureAgent passes customer data to external toolPII filter on all tool inputs. Block tool calls containing PII patterns.

FRONTIER: Self-Improving Agents

Agents that learn from failures: The frontier is agents that review their own trajectories, identify failure patterns, and propose prompt improvements. After a sprint of agent testing, the agent itself writes a "retrospective" analyzing its failures and suggesting fixes. The Scrum Master reviews these suggestions alongside the team. This is meta-agility — the agent participates in its own improvement process.
Agent Trajectory Viewer

Watch an AI agent execute a multi-step task. Each node is a tool call. Green = successful step, red = failure, yellow = loop detection. Click Run Agent to simulate an execution.

Your AI agent passes all 200 test scenarios in the sandbox. The team wants to deploy to production. What is the critical step before production deployment?

Chapter 10: Risk Management for AI

It's 2 AM. PagerDuty fires. Your production model's accuracy has dropped from 91% to 67% over the last 6 hours. Customer complaints are flooding in. The support team is escalating. You check the monitoring dashboard: the model itself hasn't changed, but the input distribution has. A viral social media post is driving a new type of question your model was never trained to handle. This is data drift, and it's the most common production failure in AI systems.

CONCEPT: The AI Risk Taxonomy

AI systems face risks that traditional software doesn't. Your sprint process must include explicit checkpoints for each category:

Risk categoryExamplesSprint checkpoint
Model regressionNew model version performs worse on a subpopulationFull eval suite before every model promotion
Data driftInput distribution changes in production vs. trainingWeekly distribution monitoring, alerting on drift metrics
Safety incidentsToxic output, hallucinated facts, PII leakageSafety eval gate before deployment + continuous monitoring
Bias detectionModel performs worse for certain demographicsFairness eval: slice metrics by demographic category
ComplianceEU AI Act, GDPR data usage, industry regulationsLegal review gate before launch, quarterly compliance audit
OperationalGPU shortage, training failure, cost overrunCompute budget tracking, cost alerts
Responsible AI is a sprint activity, not a one-time review. Don't save the safety review for the end. Embed responsible AI checkpoints throughout the sprint: bias checks during data preparation, safety evals during experimentation, fairness audits during promotion, and monitoring after deployment. Every sprint review should include a "responsible AI update."

DESIGN: The AI Risk Register

yaml
# ai_risk_register.yaml — Maintained by AI Scrum Master

risks:
  - id: RISK-001
    category: data_drift
    description: "Input distribution shifts as product usage changes"
    likelihood: high
    impact: high
    current_status: mitigated
    mitigation:
      - "Weekly drift detection (PSI on top 20 features)"
      - "Alert when PSI > 0.2 on any feature"
      - "Retrain pipeline: triggered manually, evaluated automatically"
    sprint_checkpoint: "Weekly drift report in standup (Monday)"
    owner: "ML Engineer (Sarah)"

  - id: RISK-002
    category: safety
    description: "LLM generates harmful content to vulnerable users"
    likelihood: medium
    impact: critical
    current_status: mitigated
    mitigation:
      - "Content safety classifier on all outputs (Llama Guard)"
      - "Block + log any output classified as harmful"
      - "Monthly adversarial red-team testing"
    sprint_checkpoint: "Safety metrics in every sprint review"
    owner: "AI Safety Lead (Marcus)"

  - id: RISK-003
    category: bias
    description: "Model performs worse for non-English speakers"
    likelihood: high
    impact: high
    current_status: monitoring
    mitigation:
      - "Eval suite includes multi-language test set"
      - "Accuracy sliced by language in every eval run"
      - "If gap > 5% between languages, block deployment"
    sprint_checkpoint: "Fairness metrics in eval dashboard"
    owner: "Data Scientist (Priya)"

  - id: RISK-004
    category: compliance
    description: "EU AI Act requires transparency for high-risk AI"
    likelihood: certain
    impact: medium
    current_status: in_progress
    mitigation:
      - "Model card documenting capabilities and limitations"
      - "Human-in-the-loop for high-stakes decisions"
      - "Audit trail: log all model inputs, outputs, and decisions"
    sprint_checkpoint: "Quarterly compliance review with legal"
    owner: "AI Scrum Master (You)"

CODE: Drift Detection Script

python
# drift_detector.py — Monitor input distribution changes
import numpy as np
from typing import Dict, List

def population_stability_index(expected: np.ndarray,
                                actual: np.ndarray,
                                bins: int = 10) -> float:
    """Calculate PSI between training and production distributions.
    PSI < 0.1: no significant change
    PSI 0.1-0.2: moderate change, investigate
    PSI > 0.2: significant change, retrain needed"""
    # Bin the distributions
    breakpoints = np.linspace(
        min(expected.min(), actual.min()),
        max(expected.max(), actual.max()),
        bins + 1
    )
    expected_pcts = np.histogram(expected, breakpoints)[0] / len(expected)
    actual_pcts = np.histogram(actual, breakpoints)[0] / len(actual)
    # Avoid division by zero
    expected_pcts = np.clip(expected_pcts, 0.001, None)
    actual_pcts = np.clip(actual_pcts, 0.001, None)
    # PSI formula
    psi = np.sum((actual_pcts - expected_pcts) *
                 np.log(actual_pcts / expected_pcts))
    return float(psi)

def check_drift(training_data: Dict[str, np.ndarray],
                production_data: Dict[str, np.ndarray],
                threshold: float = 0.2) -> Dict:
    """Check all features for drift. Returns alert if any exceed threshold."""
    results = {}
    alerts = []
    for feature in training_data:
        if feature not in production_data:
            continue
        psi = population_stability_index(
            training_data[feature], production_data[feature]
        )
        status = "OK" if psi < 0.1 else "WARN" if psi < threshold else "ALERT"
        results[feature] = {"psi": round(psi, 4), "status": status}
        if status == "ALERT":
            alerts.append(f"{feature}: PSI={psi:.3f}")
    return {"features": results, "alerts": alerts,
            "needs_retrain": len(alerts) > 0}

DEBUG: Incident Response Playbook

Incident typeDetectionImmediate actionRoot cause
Accuracy drop > 10%Monitoring alertRollback to previous modelCheck for data drift, eval set contamination, infra issue
Toxic output reportedContent filter log, customer reportAdd input to block list, escalate to safety teamAdversarial input, training data contamination, filter gap
PII in model outputPII scanner on outputsKill the response, notify affected user, log for compliancePII in training data (data pipeline failure)
Agent performs unauthorized actionTrajectory logger, audit trailDisable agent, review all recent actionsPrompt injection, missing guardrails, tool permission error

FRONTIER: Automated Red-Teaming

LLM-powered adversarial testing: Use one LLM to attack another. The "red team" LLM generates adversarial prompts designed to make your production model fail (produce toxic content, hallucinate, leak data). Run this automatically every sprint. Tools: Giskard, Microsoft Counterfit, NVIDIA NeMo Guardrails. This is becoming a standard sprint activity for responsible AI teams.
AI Risk Heatmap

Visualize your AI project's risk landscape. Each cell represents a risk category. Color intensity shows severity (likelihood × impact). Click risk categories to toggle mitigations.

Your production model's accuracy dropped from 91% to 78% overnight. The model weights haven't changed. What is the most likely cause and first diagnostic step?

Chapter 11: Interactive AI Sprint Board

Everything we've discussed comes together here. This is a living Kanban board designed for AI teams. It has the columns we defined in Chapter 2: Hypothesis, Experiment, Evaluation, Production, and Killed. But it also simulates the chaos of real AI sprints — blockers appear, experiments fail, stakeholders change scope, and GPUs run out.

This simulation teaches the core skill of an AI Scrum Master: reading the board, identifying bottlenecks, and making decisions under uncertainty. Watch where cards pile up. That's your bottleneck. Watch what happens when you inject a blocker. That's your rehearsal for the real thing.

How to Use the Simulation

ControlWhat it doesWhat to watch
Advance SprintMoves time forward. Cards progress through columns based on probability.Watch how experiments flow. Most should end up in "Killed" — that's healthy.
Add ExperimentAdds a new hypothesis card to the board.Watch if the board gets overloaded. Too many active experiments = WIP limit exceeded.
Data Quality IssueInjects a data blocker. Experiments in "Experiment" stage stall.Watch the cascading effect: blocked experiments push back the entire sprint.
GPU ShortageInjects a compute blocker. Only 1 experiment can run at a time.Watch how the queue backs up. This is why compute planning matters.
Eval RegressionA promoted model fails regression testing. Bounces back to "Evaluation."Watch the cost of late-stage failure. All the downstream work is wasted.
Scope ChangeStakeholder adds new requirements mid-sprint.Watch how scope creep disrupts the experiment pipeline.
AI Sprint Board Simulation

A Kanban board for AI teams. Cards are experiments flowing through stages. Inject blockers to simulate real-world disruptions. Watch how the sprint adapts.

The Architecture Behind the Simulation

Every column in the simulation maps to a real workflow stage. Here's the production sprint board configuration:

yaml
# ai_kanban_config.yaml

columns:
  hypothesis:
    wip_limit: 5
    card_fields: [hypothesis, success_criteria, compute_budget]
    exit_gate: "Dataset ready, baseline measured"

  experiment:
    wip_limit: 3    # Limited by GPU availability
    card_fields: [training_status, current_metric, compute_used]
    exit_gate: "Training complete, results logged"
    blockers: [data_quality, gpu_shortage, training_divergence]

  evaluation:
    wip_limit: 4
    card_fields: [eval_results, regression_check, safety_check]
    exit_gate: "Kill/iterate/promote decision made + documented"

  production:
    wip_limit: 2    # Don't deploy too many models at once
    card_fields: [deploy_status, monitoring_status, rollback_tested]
    exit_gate: "Model serving in production, monitoring active"

  killed:
    wip_limit: none
    card_fields: [final_metric, kill_reason, learnings]
    exit_gate: none  # Terminal state

# Sprint metrics derived from board state:
metrics:
  throughput: "Experiments completed (killed + promoted) per sprint"
  cycle_time: "Average days from Hypothesis to Decision"
  kill_rate: "% of experiments killed (healthy: 60-80%)"
  promotion_rate: "% of experiments promoted (healthy: 10-30%)"
  blocker_frequency: "Blockers injected per sprint (track over time)"

Interview Whiteboard Version

In an interview, you have 5 minutes to draw this board on a whiteboard. Here's the key talking points:

Hypothesis → Experiment
Gate: data ready + baseline measured. WIP limit: 3 active experiments (GPU constraint). Key metric: cycle time.
Experiment → Evaluation
Gate: training complete + results logged. Automated eval suite runs on completion. Blockers: data quality, GPU shortage.
Evaluation → Kill / Iterate / Promote
Decision gate: meets criteria? Healthy kill rate: 60-80%. Every kill is documented with learnings.
Production
Deployment with canary rollout. Monitoring active. Rollback tested. Model card written.
Whiteboard tips: (1) Draw the columns left-to-right with WIP limits. (2) Show the feedback loop: Evaluation can go back to Iteration. (3) Emphasize the "Killed" column — it's a feature, not a failure. (4) Talk about blockers: where they hit, how you detect them, how you unblock. (5) End with metrics: throughput, cycle time, kill rate.

Chapter 12: Interview Arsenal

This chapter distills everything into a cheat sheet you can review in the 30 minutes before your interview. Every section maps to a common interview question type for AI Scrum Master / Technical Program Manager roles.

Scenario Questions

ScenarioKey points to coverChapter
"Your AI team hasn't shipped anything in 3 sprints"Diagnose: Are experiments running but failing? (Healthy.) Are experiments not starting? (Blocker.) Is the team afraid to kill experiments? (Process.) Switch from outcome-based to learning-based velocity.1, 2
"The researcher says it will work, the engineer says it won't scale"This is the research-to-prod gap. Run a production readiness checklist. Time-box the scalability investigation to 1 sprint. If it can't scale, it can't ship.6
"Stakeholders want to launch the chatbot but accuracy is only 85%"Frame the decision: what does 15% wrong look like? Show concrete failure examples. Quantify the business cost of errors vs. the cost of delay. Propose: launch with human-in-the-loop for low-confidence answers.3, 7
"Design the sprint process for a new GenAI feature"Hypothesis-driven tickets, eval-driven acceptance, prompt version control, RAG pipeline as experiment axis, safety review gate before deployment.2, 8
"How do you manage risk for an AI agent in production?"Trajectory logging, adversarial testing, rate limits, kill switch, human-in-the-loop for high-stakes actions, continuous monitoring of agent behavior.9, 10

Common Interview Formats

FormatDurationWhat they testHow to prepare
Case study45-60 minGiven a scenario, design the sprint processPractice with the scenarios above. Draw boards.
Behavioral30-45 min"Tell me about a time you managed an AI project"STAR method: Situation, Task, Action, Result. Quantify results.
Technical30-45 min"Explain MLOps, eval pipelines, data versioning"Review Chapters 3-6. Know the tools: W&B, MLflow, DVC.
Stakeholder sim30 minInterviewer plays a frustrated PM or confused execPractice the translation framework from Chapter 7.
Whiteboard30-45 minDraw the AI sprint board, experiment lifecyclePractice drawing Chapter 11's board in 5 minutes.

Recommended Certifications and Reading

ResourceTypeWhy it matters
PSM I / PSM II (Scrum.org)CertificationBaseline Scrum knowledge. Employers expect it.
SAFe AgilistCertificationEnterprise-scale agile. Useful for large AI organizations.
"Accelerate" (Forsgren et al.)BookDORA metrics, deployment frequency, lead time. Apply to ML.
"Designing Machine Learning Systems" (Huyen)BookThe best MLOps book. Covers the full production lifecycle.
"Building LLM Apps" (Huyen)BookPractical guide to GenAI systems. Eval, RAG, prompt engineering.
MLOps CommunityCommunitySlack + meetups. Stay current on tooling and practices.
Google "Rules of ML"GuideMartin Zinkevich's 43 rules. Timeless wisdom for ML projects.

Cheat Sheet: The Five Dimensions

Interview Dimension Explorer

Click each dimension to see the key topics you should be able to discuss in an interview. This is your study guide.

Your 30-Second Elevator Pitch

When asked "Why should we hire you as an AI Scrum Master?", here's the structure:

text
"I've managed AI teams where 80% of experiments fail — and that's healthy.
My approach differs from traditional Scrum in three ways:

1. HYPOTHESIS-DRIVEN tickets instead of user stories. Each sprint,
   we commit to running N experiments, not shipping N features.
   Success is measured in learning velocity, not story points.

2. EVAL-DRIVEN acceptance. Every model change runs through an
   automated eval suite with statistical significance testing.
   We don't ship on vibes — we ship on data.

3. STAKEHOLDER TRANSLATION. I convert 'the model is 87% accurate'
   into '13 out of 100 customers get wrong answers, costing us
   $X per week.' Leadership makes decisions on business impact,
   not ML metrics.

I also embed responsible AI throughout the sprint — bias checks,
safety evals, drift monitoring — not as an afterthought but as
standard sprint activities."
In an interview, you're asked: "What metric do you use to measure AI team velocity?" What is the best answer?