Interview prep for leading AI teams: experiment-driven sprints, eval pipelines, stakeholder communication, and shipping models to production.
It is Monday morning. Your ML team's two-week sprint just ended. The results: one experiment hit target accuracy but needs 4x the GPU budget to serve in production. Another experiment failed completely — the model overfit to training data and generalizes poorly. A third experiment is "promising" but the researcher wants two more weeks to try a different architecture. The product manager is asking why the chatbot feature isn't ready for the demo on Friday. The VP wants a "timeline to ship."
You are the person who makes sense of this chaos. Not by writing code. Not by training models. But by creating the environment where a team of brilliant, sometimes chaotic researchers and engineers can do their best work — and ship it.
This is not traditional Scrum. Traditional Scrum assumes you can break work into user stories, estimate them in story points, and deliver a predictable increment every two weeks. AI projects violate every one of those assumptions. Experiments fail 80% of the time. "Done" is a spectrum of accuracy, not a checkbox. Training runs take days, not hours. And the most important work often looks like someone staring at a loss curve for three hours.
The AI Scrum Master is a new kind of role that sits at the intersection of technical program management, research facilitation, and organizational translation. You don't need to be able to write a transformer from scratch. But you need to understand enough to know when an experiment is stuck, when a researcher is chasing a dead end, and when a "90% accurate" model is actually terrible for the use case.
| Responsibility | What you own | Daily intersection |
|---|---|---|
| Sprint Facilitation | Experiment-based planning, hypothesis-driven tickets, adaptive velocity | Every standup, every planning session, every retro |
| Experiment Governance | Kill/continue decisions, resource allocation, experiment tracking visibility | Researchers need GPUs, stakeholders need progress reports |
| Data Pipeline Coordination | Data readiness, annotation quality, labeling vendor management | Data is the #1 blocker — you unblock it |
| Research-to-Prod Bridge | Handoff checklists, technical debt tracking, ML engineering coordination | The notebook-to-production gap kills more AI projects than bad models |
| Stakeholder Translation | Uncertainty communication, timeline management, expectation setting | Executives don't understand why "the model isn't ready yet" |
Here's what a typical Wednesday looks like for an AI Scrum Master at a GenAI startup:
| Time | Activity | Skills used |
|---|---|---|
| 9:00 AM | Check overnight training runs: one converged, one diverged (NaN loss at epoch 47) | Experiment tracking, debugging |
| 9:30 AM | Standup: researcher reports eval regression after data refresh, ML engineer blocked on GPU quota | Sprint facilitation, resource management |
| 10:00 AM | Triage the eval regression: new labeling batch has quality issues, escalate to vendor | Data pipeline management |
| 11:00 AM | Sprint planning prep: write hypothesis-driven tickets for next sprint's RAG experiments | Experiment-based planning |
| 12:00 PM | Stakeholder sync: explain why the chatbot needs another sprint — it hallucinates on edge cases | Stakeholder communication |
| 2:00 PM | Review handoff checklist: model is ready for production but needs latency optimization | Research-to-prod bridge |
| 3:00 PM | Safety review: new agent can now execute code — needs guardrails before deployment | Risk management, responsible AI |
| 4:00 PM | Retro: discuss why the last sprint's velocity was half the forecast (GPU shortage + scope creep) | Process improvement |
The diagram below shows the flow of work in an AI team. Unlike a traditional software pipeline where features flow left-to-right with predictable completion, experiments flow through a funnel where most are killed. This is healthy.
Watch experiments flow through the funnel. Most are killed — that's healthy. Click Add Experiment to inject new hypotheses. Click Kill Stale to remove experiments that exceeded their time box.
Staff-level interviews test you across five dimensions. Each chapter in this lesson maps to one or more:
| Dimension | What they ask | Chapters |
|---|---|---|
| CONCEPT | "Explain why story points don't work for ML research" | All |
| DESIGN | "Design a sprint process for a team building an LLM-powered product" | 0, 1, 2, 8, 11 |
| CODE | "Show me the Jira board / experiment tracker / eval dashboard you'd set up" | 2, 4, 5, 6, 8 |
| DEBUG | "Your team hasn't shipped anything in 3 sprints. Diagnose the problem." | 3, 7, 10 |
| FRONTIER | "How will AI-assisted project management change this role?" | All |
A product manager shows you a roadmap: "Q1: build the model. Q2: integrate it. Q3: launch." You know immediately this will fail. AI projects don't work in sequential phases. The lifecycle is a loop, not a line — and the most important skill is knowing when to exit the loop.
The AI project lifecycle looks like this: research, prototype, evaluate, iterate, and — only when evaluation meets production criteria — ship. But here's what makes it different from traditional software: you might loop through research-prototype-evaluate fifteen times before anything is production-ready. And three of those loops might end in "this approach fundamentally doesn't work, start over."
Every sprint, you face a decision borrowed from reinforcement learning: explore (try new approaches, architectures, datasets) or exploit (optimize what's already working). Early in a project, you should be 80% explore, 20% exploit. As you approach a deadline, it flips: 20% explore, 80% exploit.
The Scrum Master's job is to manage this ratio explicitly. If the team is exploring too late, they'll never ship. If they're exploiting too early, they'll ship a mediocre model because they never found the right approach.
| Phase | Explore:Exploit | Sprint focus | Your role |
|---|---|---|---|
| Discovery (0-4 weeks) | 90:10 | Wide search — try 5 approaches, kill 4 | Protect research time, resist pressure to "pick one and go" |
| Convergence (4-8 weeks) | 50:50 | 2-3 approaches competing, deeper experiments | Set decision gates, track eval metrics, prepare kill criteria |
| Optimization (8-12 weeks) | 10:90 | One approach, hyperparameter tuning, edge cases | Track diminishing returns, push for production readiness |
| Productionization (12-16 weeks) | 5:95 | Latency, reliability, monitoring, deployment | Coordinate ML eng + infra, handoff checklists, launch planning |
Traditional waterfall assumes you can specify requirements upfront, build to those requirements, and test against them. AI violates all three assumptions:
yaml # Traditional software: requirements are deterministic requirement: "When user clicks 'Submit', save form data to database" test: "Assert DB contains the submitted data after click" estimate: "3 story points (1-2 days)" # AI project: requirements are probabilistic requirement: "Chatbot answers customer questions accurately" test: "What does 'accurately' mean? 90%? 95%? On which questions? Measured how? By whom? What's the baseline?" estimate: "Unknown. Could be 2 weeks or 6 months depending on data quality, model choice, and what 'accurate' means."
Before any sprint planning, create an AI Project Canvas — a one-page document that aligns the team on what success looks like:
markdown # AI Project Canvas: Customer Support Chatbot ## Business Objective Reduce ticket volume by 30% by deflecting common questions to AI. ## Success Metrics (ordered by priority) 1. Deflection rate: % of conversations resolved without human handoff 2. Customer satisfaction: CSAT score >= 4.2/5 on AI-handled conversations 3. Accuracy: Correct answer rate >= 92% on eval set (500 questions) 4. Latency: p95 response time < 2 seconds ## Data Availability - 50K historical support tickets (labeled by category, resolution) - 2K manually curated Q&A pairs for eval - Real-time access to product docs (RAG source) - GAP: No labeled "bad answer" examples for safety eval ## Technical Constraints - Budget: $5K/month inference cost (rules out GPT-4 at scale) - Latency: Must respond in < 2s (rules out chain-of-thought with 5 LLM calls) - Privacy: No customer PII sent to external APIs (rules out OpenAI for EU customers) - Infra: Kubernetes cluster with 4x A100 GPUs available ## Known Risks - Hallucination on product-specific questions (mitigation: RAG + citations) - Data drift as product changes (mitigation: weekly eval re-runs) - Regulatory review needed before launch (blocker: 3-week legal review) ## Time Box - Discovery: 3 sprints (6 weeks) - Ship MVP: Sprint 5 (week 10) - Kill criteria: If accuracy < 80% after sprint 3, pivot approach
python # ai_lifecycle_tracker.py — Track project phase and explore/exploit ratio from dataclasses import dataclass, field from datetime import datetime, timedelta from enum import Enum from typing import List class Phase(Enum): DISCOVERY = "discovery" CONVERGENCE = "convergence" OPTIMIZATION = "optimization" PRODUCTION = "production" @dataclass class Experiment: hypothesis: str status: str = "active" # active | killed | promoted start_date: datetime = None compute_budget_hrs: float = 0 compute_used_hrs: float = 0 best_metric: float = 0.0 metric_history: List[float] = field(default_factory=list) def should_kill(self, target_metric: float) -> tuple[bool, str]: # Rule 1: Over budget if self.compute_used_hrs > 2 * self.compute_budget_hrs: return True, "2x compute budget exceeded" # Rule 2: Stuck — last 3 runs within 1% of each other if len(self.metric_history) >= 3: recent = self.metric_history[-3:] if max(recent) - min(recent) < 0.01: return True, "Metric plateau (3 runs within 1%)" # Rule 3: Gap too large if self.best_metric > 0 and (target_metric - self.best_metric) > 0.15: return True, f"Gap to target ({target_metric - self.best_metric:.1%}) too large" return False, "Continue"
Signs your project is stuck in the wrong phase:
| Symptom | Likely cause | Fix |
|---|---|---|
| Still exploring after 8 weeks | No kill criteria defined — team can't commit | Set a hard decision gate: "By sprint 5, we pick one approach" |
| Optimizing a 75% model to 77% | Wrong phase — should still be exploring | Step back. Is 77% good enough? If not, a different architecture might get 90% |
| Never shipping "because we can improve it" | Perfectionism / fear of production failures | Ship with guardrails. A 90% model with good fallbacks beats a 95% model that never ships |
| Researcher working alone for 3 weeks | No visibility into experiment progress | Daily experiment stand-ups. "What did you try? What did you learn? What's next?" |
The frontier is using AI to manage AI projects. Tools emerging now:
Adjust the slider to see how the explore/exploit ratio should shift across project phases. The bar chart shows recommended time allocation.
The team lead says: "This story is 5 points." You ask what the story is. "Fine-tune the model on the new dataset." You ask: how long will training take? "Depends on the data." What accuracy do you expect? "We'll know when we try." Will it work? "Maybe." This is not estimable in story points. And pretending it is creates false confidence that wrecks your sprint velocity and demoralizes the team.
Story points don't work for research. Story points assume you can estimate relative complexity of tasks that you've done before. But in AI, most experiments are novel. You've never fine-tuned this model on this data with these hyperparameters before. The uncertainty is fundamental, not an estimation failure.
Replace user stories with experiment tickets. Each ticket is a testable hypothesis with a clear accept/reject criterion:
markdown # Traditional User Story (DOESN'T WORK for AI) ## Story: Improve chatbot accuracy As a customer, I want the chatbot to answer my questions correctly so that I don't have to wait for a human agent. Acceptance: Chatbot answers questions correctly. Points: 8 # Hypothesis-Driven Experiment Ticket (USE THIS) ## EXP-042: Fine-tune Llama-3-8B on support corpus **Hypothesis:** Fine-tuning Llama-3-8B on our 50K support ticket corpus will improve answer accuracy from 72% (baseline: zero-shot) to >= 85% on the 500-question eval set. **Method:** - Dataset: 50K tickets, 80/10/10 train/val/test split - Model: Llama-3-8B, QLoRA (rank 16, alpha 32) - Training: 3 epochs, lr=2e-4, batch_size=4, gradient_accumulation=8 - Eval: accuracy, F1, latency on test set **Compute budget:** 24 GPU-hours (1x A100, ~1 day) **Time box:** 1 sprint (2 weeks including eval and documentation) **Success criteria:** - >= 85% accuracy on eval set (primary) - Latency < 500ms per response (secondary) - No regression on safety eval (blocking) **Kill criteria:** - Accuracy < 78% after full training = KILL - Training loss doesn't decrease after epoch 1 = KILL (data issue) **Outcome:** [TO BE FILLED AFTER EXPERIMENT]
A traditional sprint board has: To Do, In Progress, Done. For AI teams, you need more columns that reflect the experiment lifecycle:
| Column | What lives here | Exit criteria |
|---|---|---|
| Backlog | Hypotheses not yet prioritized | Team agrees to run it this sprint |
| Hypothesis | Experiment designed, compute budget approved | Dataset ready, baseline measured |
| Experiment | Training/running in progress | Run complete, results logged |
| Evaluation | Results being analyzed against success criteria | Kill/iterate/promote decision made |
| Iteration | Promising experiment being refined | Meets success criteria or killed |
| Production | Model promoted to production pipeline | Deployed, monitored, signed off |
| Killed | Experiments that didn't meet threshold | Documented with learnings |
yaml # jira_ai_sprint_config.yaml # Custom issue types for AI teams issue_types: - name: Experiment icon: flask fields: - hypothesis: text # What we're testing - method: text # How we'll test it - baseline_metric: number # Current best - target_metric: number # What we need - compute_budget: text # "24 GPU-hours" - time_box: text # "1 sprint" - kill_criteria: text # When to stop - outcome: select # killed | promoted | iterating - final_metric: number # What we achieved - learnings: text # What we learned (REQUIRED on close) - wandb_link: url # Link to experiment tracking - name: Data Task icon: database fields: - data_source: text - volume: text # "10K examples" - quality_gate: text # "Inter-annotator agreement > 0.8" - blocking_experiments: link # Which experiments need this data - name: ML Engineering icon: gear fields: - type: select # infra | optimization | deployment - model_artifact: text # Which model version - latency_target: text - cost_target: text workflow: Experiment: - Backlog -> Hypothesis # Prioritized in sprint planning - Hypothesis -> Experiment # Dataset ready, baseline set - Experiment -> Evaluation # Training complete - Evaluation -> Killed # Didn't meet criteria - Evaluation -> Iteration # Promising, needs refinement - Evaluation -> Production # Meets all criteria - Iteration -> Evaluation # Re-evaluate after refinement # Sprint velocity is measured in EXPERIMENTS COMPLETED, not story points. # A "completed" experiment is one with a kill/promote decision + documented learnings.
| Ceremony | Traditional | AI Adaptation |
|---|---|---|
| Sprint Planning | Estimate stories, commit to scope | Prioritize experiments by expected information gain. Commit to running N experiments, not to outcomes. |
| Daily Standup | "What did you do? What will you do? Blockers?" | "What did you learn? What's your next experiment? Are you stuck on data/compute/clarity?" |
| Sprint Review | Demo features to stakeholders | Share experiment results. Show eval dashboards. Explain what we learned, not just what we built. |
| Retrospective | Process improvement | Process improvement + EXPERIMENT REVIEW: Which kills were good calls? Which should we have killed earlier? Are our hypotheses getting better? |
| Symptom | Root cause | Fix |
|---|---|---|
| Velocity is wildly inconsistent | Measuring in story points on uncertain work | Switch to experiment throughput: completed experiments per sprint |
| Team always "almost done" | No time boxes on experiments | Every experiment gets a hard time box. At the deadline: kill, iterate, or promote. |
| Planning takes 4 hours | Trying to fully specify experiments upfront | Specify hypothesis + success criteria only. Method details emerge during execution. |
| Researchers skip planning | They see it as bureaucracy that slows research | Make planning about THEIR priorities. "What do YOU want to try? What do you need from the team?" |
Allocate your team's sprint capacity across experiment types. Adjust sliders to balance exploration, optimization, data work, and ML engineering.
Your QA engineer runs the chatbot test suite. Monday: 91% pass rate. Tuesday, same code, same model, same test suite: 88% pass rate. Wednesday: 93%. Nothing changed. The model is non-deterministic — given the same input, it can produce different outputs depending on sampling temperature, random seeds, and floating-point arithmetic. Welcome to the world where "it works on my machine" is literally true only that one time.
Non-determinism is the defining challenge that separates AI project management from traditional software. In traditional software, a test either passes or fails. In AI, a test passes 91% of the time, and your job is to decide if that's good enough.
In traditional Scrum, "done" is binary: the feature works or it doesn't. In AI, "done" is a set of thresholds across multiple dimensions:
yaml # Definition of Done for AI features model_quality: accuracy: ">= 90% on eval set (500 examples)" precision: ">= 88% (false positives are costly)" recall: ">= 85% (false negatives are acceptable)" latency_p95: "< 500ms" consistency: "Variance < 3% across 5 eval runs with different seeds" safety: toxicity: "< 0.1% toxic outputs on safety eval (1000 adversarial prompts)" hallucination: "< 5% hallucinated facts on factual eval" pii_leakage: "0% PII in outputs" regression: no_regression: "All metrics within 2% of previous production model" backward_compat: "Existing integrations produce equivalent outputs" operational: monitoring: "Eval metrics dashboarded and alerting configured" rollback: "Can revert to previous model version in < 5 minutes" documentation: "Model card completed with known limitations"
Replace the traditional "acceptance criteria" (manual QA checkboxes) with eval-driven acceptance:
python # eval_pipeline.py — Automated eval with statistical testing import numpy as np from scipy import stats from dataclasses import dataclass from typing import Dict, List @dataclass class EvalResult: metric_name: str scores: List[float] # Multiple runs with different seeds threshold: float baseline_scores: List[float] @property def mean(self): return np.mean(self.scores) @property def std(self): return np.std(self.scores) @property def passes_threshold(self): return self.mean >= self.threshold @property def is_significant_improvement(self): # Welch's t-test: is the new model significantly better? t_stat, p_val = stats.ttest_ind(self.scores, self.baseline_scores, equal_var=False) return p_val < 0.05 and t_stat > 0 @property def has_regression(self): # Is the new model significantly WORSE? t_stat, p_val = stats.ttest_ind(self.scores, self.baseline_scores, equal_var=False) return p_val < 0.05 and t_stat < 0 def run_eval_gate(results: List[EvalResult]) -> Dict: """Returns go/no-go decision for production promotion.""" report = {"pass": True, "details": []} for r in results: detail = { "metric": r.metric_name, "mean": f"{r.mean:.3f} +/- {r.std:.3f}", "threshold": r.threshold, "passes": r.passes_threshold, "significant": r.is_significant_improvement, "regression": r.has_regression } if not r.passes_threshold or r.has_regression: report["pass"] = False report["details"].append(detail) return report
The scariest moment in AI development: the new model is better on your target metric but worse on something you didn't check. This is silent regression.
| Regression type | How to detect | Prevention |
|---|---|---|
| Metric regression | Eval suite catches it | Run FULL eval suite, not just target metric |
| Distribution shift | Model degrades on a subpopulation | Slice eval by category (e.g., test per language, per topic) |
| Latency regression | Bigger model = slower inference | Include latency in eval gate criteria |
| Safety regression | New model generates toxic content | Safety eval is a BLOCKING gate, not optional |
| Behavioral regression | Model answers differently but "correctly" — breaks downstream | Golden test set: 50 hand-picked examples that must match exactly |
See how model accuracy varies across evaluation runs. Each run uses a different random seed. Adjust the variance slider to simulate different model stability levels. Green zone = passing threshold.
It is sprint planning. The ML engineer says: "I can start the experiment as soon as the labeled data is ready." The data lead says: "We sent 5,000 examples to the labeling vendor last week. They said 7-10 business days." The sprint is 10 business days. The experiment needs labeled data by day 3 to have time for training and evaluation. The math doesn't work. And nobody realized it until just now.
Data is the #1 blocker for AI teams. Not compute, not model architecture, not engineering talent. Data. Specifically: getting enough high-quality, correctly labeled data to the right team at the right time. Your job as AI Scrum Master is to make data readiness visible and plan around it.
Think of data like a supply chain in manufacturing. You need raw materials (unlabeled data), processing (annotation), quality control (validation), and delivery (versioned datasets). Any disruption at any stage blocks everything downstream.
| Stage | Lead time | Common blockers | Your mitigation |
|---|---|---|---|
| Collection | 1-4 weeks | Legal approval for scraping, API rate limits, privacy review | Start collection 2 sprints before experiments need it |
| Annotation | 1-3 weeks | Vendor capacity, unclear guidelines, low agreement | Write annotation guides BEFORE sending to vendors |
| Validation | 2-5 days | Quality issues requiring re-labeling | Spot-check first 100 labels before approving full batch |
| Versioning | 1 day | No versioning = "which dataset did you train on?" | DVC or similar tool, version every dataset change |
Add a parallel track to your sprint board specifically for data:
yaml # data_readiness_board.yaml columns: - name: "Data Needed" description: "Experiment requires data that doesn't exist yet" cards_include: experiment_id, data_type, volume, deadline - name: "Collection" description: "Raw data being gathered" cards_include: source, method, legal_approval, eta - name: "Annotation" description: "Data sent to labeling vendor/team" cards_include: vendor, volume, guidelines_link, eta, cost - name: "QA" description: "Labeled data being validated" cards_include: sample_checked, agreement_score, issues_found - name: "Ready" description: "Versioned, validated, available in data store" cards_include: version, location, row_count, quality_score # Key metric: Data Readiness Rate # = (experiments with data ready on time) / (total experiments planned) # Target: >= 80%. Below 60% = systemic planning failure.
python # data_quality_monitor.py — Track annotation quality in real time import numpy as np from collections import Counter from dataclasses import dataclass from typing import List, Dict @dataclass class AnnotationBatch: batch_id: str vendor: str total_examples: int labels: List[Dict] # [{"id": "ex_001", "label": "positive", "annotator": "a1"}, ...] double_labels: List[Dict] # Same examples labeled by 2 annotators @property def inter_annotator_agreement(self) -> float: """Cohen's kappa between annotators on double-labeled examples.""" if not self.double_labels: return 0.0 agreements = sum( 1 for d in self.double_labels if d["label_a"] == d["label_b"] ) p_observed = agreements / len(self.double_labels) # Simplified kappa (full version accounts for chance agreement) label_counts = Counter(d["label_a"] for d in self.double_labels) total = len(self.double_labels) p_chance = sum((c/total)**2 for c in label_counts.values()) if p_chance == 1.0: return 1.0 return (p_observed - p_chance) / (1 - p_chance) @property def label_distribution(self) -> Dict[str, float]: """Check for label imbalance.""" counts = Counter(l["label"] for l in self.labels) total = sum(counts.values()) return {k: v/total for k, v in counts.items()} def quality_report(self) -> Dict: iaa = self.inter_annotator_agreement dist = self.label_distribution issues = [] if iaa < 0.6: issues.append(f"LOW AGREEMENT: kappa={iaa:.2f} (need >= 0.8)") for label, pct in dist.items(): if pct > 0.9 or pct < 0.05: issues.append(f"IMBALANCE: '{label}' is {pct:.0%} of labels") return { "batch_id": self.batch_id, "agreement": iaa, "distribution": dist, "issues": issues, "status": "PASS" if not issues else "FAIL" }
| Problem | Diagnosis | Emergency fix | Systemic fix |
|---|---|---|---|
| Vendor missed delivery date | Scope was unclear, vendor underestimated | Use partial batch, reduce experiment scope | Send 10% pilot batch first, validate quality and timeline |
| Labels are low quality (kappa < 0.6) | Annotation guidelines are ambiguous | Re-label a subset with your own team | Create detailed guidelines with examples, run calibration sessions |
| Data has PII that blocks legal review | Nobody checked before sending to vendor | Apply PII scrubbing pipeline, re-annotate | PII check is a gate BEFORE annotation starts |
| "Which dataset did we train on?" | No data versioning | Hash the dataset file, log it in experiment tracker | DVC + data registry + version in every experiment config |
Visualize the data pipeline stages and their lead times. Red sections indicate blockers. Green = on track. Click Add Blocker to simulate common disruptions.
The VP asks: "How close are we to shipping the model?" You pull up the researcher's Jupyter notebook. There are 47 cells with names like "test_v3_final_FINAL_2." The loss curve is in a matplotlib plot embedded in cell 23. The best hyperparameters are in a comment on cell 31. The eval results are in a Slack message from last Thursday. This is not experiment tracking. This is chaos.
Experiment tracking is the practice of systematically recording every experiment's configuration, results, and artifacts so that anyone on the team (including future-you) can reproduce, compare, and build on past work. For the Scrum Master, it's also the source of truth for sprint progress. You don't ask "how's the experiment going?" — you look at the dashboard.
Every experiment produces three types of artifacts that must be tracked:
| Artifact type | Examples | Why it matters |
|---|---|---|
| Configuration | Model, hyperparameters, dataset version, code commit | Reproducibility: can you rerun this exact experiment? |
| Metrics | Loss curves, accuracy, F1, latency, cost | Comparison: is this better than the last experiment? |
| Artifacts | Model weights, eval predictions, error analysis | Promotion: which model file goes to production? |
yaml # experiment_dashboards.yaml — Three views, one data source # 1. RESEARCHER VIEW (W&B / MLflow) researcher_dashboard: charts: - loss_curves: "Training + validation loss over epochs, per experiment" - hyperparameter_sweep: "Parallel coordinates plot of lr, batch_size, etc." - confusion_matrix: "Per-class performance on eval set" - error_examples: "Top 20 hardest examples the model gets wrong" filters: [model_type, dataset_version, date_range] refresh: real-time # 2. SCRUM MASTER VIEW (synthesized from W&B data) scrum_dashboard: cards: - experiments_this_sprint: {total: 5, completed: 3, killed: 1, active: 1} - best_accuracy: {current: "88.3%", target: "90%", gap: "1.7%"} - compute_budget: {used: "72 GPU-hrs", total: "120 GPU-hrs", pct: "60%"} - data_readiness: {ready: 3, in_progress: 1, blocked: 1} table: columns: [experiment_id, hypothesis, status, best_metric, decision] refresh: hourly # 3. STAKEHOLDER VIEW (executive summary) stakeholder_dashboard: cards: - project_phase: "Convergence (Week 6 of 16)" - headline_metric: "Best model: 88.3% accuracy (target: 90%)" - confidence: "High — on track to hit 90% by week 10" - next_milestone: "Sprint 4 Review — May 30" - risks: "Data vendor delay on safety eval set" chart: - accuracy_over_time: "Weekly best accuracy with trend line" refresh: weekly
python # experiment_logger.py — Wraps W&B/MLflow for sprint tracking import wandb from datetime import datetime from typing import Optional class SprintExperimentLogger: def __init__(self, project: str, sprint_id: str): self.sprint_id = sprint_id self.project = project def start_experiment(self, config: dict) -> str: """Start a tracked experiment with sprint metadata.""" run = wandb.init( project=self.project, config={ **config, "sprint_id": self.sprint_id, "started_at": datetime.now().isoformat(), "hypothesis": config.get("hypothesis", "Not specified"), "compute_budget_hrs": config.get("compute_budget_hrs", 0), "kill_criteria": config.get("kill_criteria", "Not specified"), }, tags=[f"sprint-{self.sprint_id}", config.get("model_type", "unknown")] ) return run.id def log_decision(self, experiment_id: str, decision: str, final_metric: float, learnings: str): """Log the kill/promote/iterate decision.""" wandb.log({ "decision": decision, # killed | promoted | iterating "final_metric": final_metric, "learnings": learnings, "decided_at": datetime.now().isoformat() }) # Also update the Jira ticket via API self._update_jira_ticket(experiment_id, decision, final_metric) def sprint_summary(self) -> dict: """Generate sprint review summary from experiment data.""" api = wandb.Api() runs = api.runs(self.project, filters={"config.sprint_id": self.sprint_id}) summary = { "total": 0, "killed": 0, "promoted": 0, "iterating": 0, "active": 0, "best_metric": 0, "learnings": [] } for run in runs: summary["total"] += 1 decision = run.summary.get("decision", "active") summary[decision] += 1 metric = run.summary.get("final_metric", 0) if metric > summary["best_metric"]: summary["best_metric"] = metric return summary
Stakeholders don't care about accuracy. They care about outcomes. Your job is to translate:
| ML metric | Business translation | How to present |
|---|---|---|
| Accuracy: 72% → 88% | "We went from deflecting 72% of support tickets to 88% — that's 1,600 fewer human-handled tickets per week" | Show the dollar savings: 1,600 tickets × $5 avg cost = $8K/week saved |
| Latency: 2s → 400ms | "Customer wait time dropped from 2 seconds to under half a second" | Show before/after UX recording |
| Hallucination rate: 8% → 2% | "The chatbot now gives wrong answers 1 in 50 times instead of 1 in 12" | Show specific examples of prevented hallucinations |
A simulated experiment tracking dashboard. Each bar represents an experiment's best metric. Green = promoted, red = killed, yellow = active. Click Run Experiment to add results.
The researcher posts in Slack: "Model is ready! 92% accuracy! Here's the notebook." The ML engineer opens the notebook. It imports from a local path that doesn't exist on the production server. The data preprocessing uses a different tokenizer than the inference pipeline. The model was trained on Python 3.11 with PyTorch 2.1, but production runs Python 3.9 with PyTorch 1.13. The "92% accuracy" was measured on a test set that accidentally overlapped with the training set. This is the "works in notebook, breaks in prod" problem, and it kills more AI projects than bad models.
Research and production have fundamentally different requirements. A researcher optimizes for speed of iteration. A production engineer optimizes for reliability and scale. The gap between them is where AI projects die.
| Dimension | Research | Production | The gap |
|---|---|---|---|
| Code | Jupyter notebook, quick and dirty | Tested, typed, packaged Python modules | Rewrite everything |
| Data | Local CSV, ad-hoc preprocessing | Versioned datasets, pipeline DAGs | Different preprocessing = different results |
| Deps | Whatever pip installed today | Locked requirements, container images | Version conflicts, CUDA mismatches |
| Infra | Single GPU, batch processing | Multi-GPU, real-time, auto-scaling | 10x latency at scale |
| Eval | "Looks good on my test set" | Automated eval suite, A/B test in prod | Offline eval ≠ online performance |
markdown # Model Production Readiness Checklist ## 1. Reproducibility - [ ] Training code runs from a single command (not a notebook) - [ ] All dependencies pinned in requirements.txt / pyproject.toml - [ ] Docker image builds and runs successfully - [ ] Random seeds documented; results reproducible within 1% - [ ] Dataset version tracked (DVC hash or equivalent) ## 2. Evaluation - [ ] Eval suite passes on PRODUCTION eval set (not training set) - [ ] No data leakage: train/eval sets verified disjoint - [ ] Metrics run 5x with different seeds (variance documented) - [ ] Compared against current production model (no regression) - [ ] Safety eval passes (toxicity, hallucination, PII) - [ ] Latency measured under production-like load ## 3. Integration - [ ] Input/output schema matches API contract - [ ] Preprocessing pipeline is IDENTICAL to training preprocessing - [ ] Model serves via the production serving framework (TorchServe, vLLM, etc.) - [ ] Error handling for malformed inputs - [ ] Graceful degradation when model times out ## 4. Operational Readiness - [ ] Model card written (purpose, limitations, biases) - [ ] Monitoring dashboards configured (accuracy, latency, error rate) - [ ] Alerting rules set (accuracy drops > 5%, latency p99 > 2x) - [ ] Rollback procedure tested (revert to previous model in < 5 min) - [ ] A/B test configured (serve new model to 5%, measure, then ramp) ## 5. Sign-offs - [ ] ML researcher: "Model meets eval criteria" - [ ] ML engineer: "Inference pipeline passes integration tests" - [ ] Data engineer: "Data pipeline feeds correct data" - [ ] Product manager: "Feature meets user requirements" - [ ] Security/Legal: "Model complies with policies" (if applicable)
python # handoff_validator.py — Automated checks for production readiness import subprocess import json from pathlib import Path class HandoffValidator: def check_reproducibility(self, model_dir: Path) -> dict: checks = {} # 1. requirements.txt exists and is pinned req_file = model_dir / "requirements.txt" checks["deps_pinned"] = req_file.exists() and all( "==" in line for line in req_file.read_text().strip().split("\n") if line and not line.startswith("#") ) # 2. Dockerfile exists checks["dockerfile"] = (model_dir / "Dockerfile").exists() # 3. No notebooks in production code checks["no_notebooks"] = not any(model_dir.glob("**/*.ipynb")) # 4. Data version tracked checks["data_versioned"] = ( (model_dir / ".dvc").exists() or (model_dir / "data_version.json").exists() ) return checks def check_eval_integrity(self, eval_config: dict) -> dict: checks = {} # Verify train/eval sets are disjoint train_ids = set(eval_config["train_ids"]) eval_ids = set(eval_config["eval_ids"]) overlap = train_ids & eval_ids checks["no_data_leakage"] = len(overlap) == 0 if overlap: checks["leaked_ids"] = list(overlap)[:10] return checks def full_check(self, model_dir: Path, eval_config: dict) -> dict: repro = self.check_reproducibility(model_dir) eval_int = self.check_eval_integrity(eval_config) all_checks = {**repro, **eval_int} passed = all(v if isinstance(v, bool) else True for v in all_checks.values()) return {"passed": passed, "checks": all_checks}
| Failure | Root cause | Detection | Prevention |
|---|---|---|---|
| Model accuracy drops 10% in prod | Preprocessing differs between training and serving | Run eval suite through the SERVING pipeline, not research pipeline | Share preprocessing code between training and serving |
| Model loads but crashes on edge cases | Input validation missing in serving code | Fuzz testing with malformed inputs | Input schema validation in serving layer |
| Latency 5x slower than expected | Research used batch processing; prod needs single-request | Load test before promotion | Latency target in experiment ticket |
| "92% accuracy" was on contaminated eval | Train/eval overlap | Handoff validator catches it | Eval set is created and frozen before ANY training begins |
Watch a model move from research to production. Each gate checks a specific requirement. Red gates block deployment. Click Promote Model to start the handoff.
The CEO walks into your sprint review. The team just achieved 87% accuracy on the customer support chatbot. The CEO asks: "Is 87% good?" The researcher starts explaining F1 scores and confusion matrices. The CEO's eyes glaze over. You step in: "It means the chatbot gives the right answer 87 times out of 100. Our target is 92 — right now, 13 out of 100 customers would get a wrong answer, which would frustrate them. We need five more percentage points. Based on our experiments, we'll get there in about three weeks."
That's the job. Translating uncertainty into actionable information that non-technical people can make decisions with. It's possibly the single most valuable skill an AI Scrum Master has.
Stakeholders ask three questions. Each one invites a lie:
| Question | The lie | The truth | How to say it |
|---|---|---|---|
| "When will it be ready?" | "End of Q2" | We don't know. AI timelines are probabilistic. | "Based on our current trajectory, there's a 70% chance we hit the target by end of Q2. The 30% risk is data quality issues." |
| "How accurate is it?" | "87% accurate" | 87% on our eval set — which might not represent real-world usage. | "87% on our test set of 500 questions. We expect 80-85% in production because real users ask harder questions." |
| "Can you just add [feature]?" | "Sure, next sprint" | Each new capability requires a new eval suite, new data, new experiments. | "We can prototype it, but validating it to production quality is 4-6 weeks of experiments." |
markdown # AI Project Status Report Template ## One-Line Summary [Current metric] / [Target metric] — [Trajectory statement] Example: "88% / 92% — On track for 92% in Sprint 7 (3 weeks)" ## Progress Since Last Report - Experiments completed: 4 (2 killed, 1 promoted, 1 iterating) - Key learning: [What we learned that changes our approach] - Best metric improvement: [X% → Y%] via [what technique] ## Risks and Blockers | Risk | Impact | Likelihood | Mitigation | |------|--------|------------|------------| | Data vendor delay | 1 sprint slip | Medium | Using synthetic data as bridge | | GPU shortage | Can't run 2 experiments in parallel | Low | Reserved spot instances | ## What We Need from Leadership - [ ] Approval for $3K additional labeling budget - [ ] Decision: ship at 90% or wait for 92%? - [ ] Legal review scheduled before launch date ## Next Milestone [What we'll demonstrate at the next sprint review]
Use confidence cones instead of single-point estimates. A confidence cone shows the range of possible outcomes:
Visualize how confidence narrows as the project progresses. Early estimates have wide ranges. As experiments provide data, the range shrinks. Adjust the progress slider to see how uncertainty decreases.
python # status_report.py — Generate stakeholder-friendly status from experiment data from datetime import datetime def generate_status_report(experiments: list, target_metric: float, sprint_num: int, total_sprints: int) -> str: completed = [e for e in experiments if e["status"] != "active"] best = max((e["best_metric"] for e in completed), default=0) gap = target_metric - best killed = len([e for e in completed if e["status"] == "killed"]) promoted = len([e for e in completed if e["status"] == "promoted"]) # Estimate sprints to target based on improvement rate metrics_by_sprint = {} # Group best metric per sprint for e in completed: s = e.get("sprint", sprint_num) if s not in metrics_by_sprint or e["best_metric"] > metrics_by_sprint[s]: metrics_by_sprint[s] = e["best_metric"] if len(metrics_by_sprint) >= 2: sprints = sorted(metrics_by_sprint.keys()) improvement_per_sprint = ( metrics_by_sprint[sprints[-1]] - metrics_by_sprint[sprints[0]] ) / (sprints[-1] - sprints[0]) if improvement_per_sprint > 0: sprints_to_target = gap / improvement_per_sprint eta = f"~{sprints_to_target:.0f} sprints at current rate" else: eta = "STALLED — improvement rate is zero" else: eta = "Insufficient data for estimate" return f""" # Sprint {sprint_num} Status ({datetime.now().strftime('%B %d')}) Best metric: {best:.1%} / Target: {target_metric:.1%} (gap: {gap:.1%}) ETA to target: {eta} Experiments: {len(completed)} completed ({killed} killed, {promoted} promoted) """
| Symptom | Root cause | Fix |
|---|---|---|
| Executives surprise-ask for features mid-sprint | They don't understand the experiment cycle | Educate: "Each new capability = 2-4 sprint experiment cycle" |
| PM overpromises to customers | You gave a best-case estimate without the range | Always give confidence ranges: "70% chance by June, 90% by July" |
| Team is demoralized by "failed" experiments | Success is framed as accuracy numbers, not learning | Reframe: every sprint review starts with "What we learned" |
| Board thinks AI is a waste of money | No connection between experiments and business value | Translate every metric to dollars or customer impact |
Your team is building a customer support chatbot powered by an LLM. The "code" is a 500-word system prompt. The "testing" is running 200 customer questions and having three humans grade the answers. The "deployment" is changing an API key from GPT-3.5-turbo to GPT-4o. The "performance optimization" is rewriting a paragraph of the prompt. Nothing about this looks like traditional software development, and your sprint process needs to reflect that.
Prompt engineering sprints are the new unit of work for GenAI teams. A single prompt change can shift model behavior more than weeks of fine-tuning. But prompt changes are also unpredictable — a change that improves one capability can degrade another.
| Activity | Traditional ML equivalent | Sprint time |
|---|---|---|
| System prompt iteration | Architecture search | 2-5 days per major revision |
| Few-shot example curation | Training data curation | 1-3 days |
| RAG pipeline tuning | Feature engineering | 1-2 weeks |
| Eval set creation | Test suite authoring | 3-5 days (ongoing) |
| Model migration (GPT-4 → Claude) | Framework migration | 2-4 weeks (prompt rewriting + re-eval) |
| Fine-tuning | Model training | 1-2 weeks (data prep + training + eval) |
yaml # genai_sprint_template.yaml sprint_week_1: monday: - Review last sprint's eval results - Prioritize prompt improvements by impact - Assign RAG pipeline experiments tuesday_thursday: - Prompt engineering: iterate on system prompt - RAG experiments: test different chunking, retrieval, reranking - Run eval suite after EACH significant change - Daily eval check-in: "What moved? What regressed?" friday: - Eval freeze: run full eval suite on best candidates - Document prompt changelog (version control the prompt!) sprint_week_2: monday_wednesday: - A/B test top 2 prompt versions with real traffic (5% canary) - Fine-tuning experiment (if applicable) - RAG pipeline: index new documents, test retrieval quality thursday: - Production promotion decision - Sprint review prep: compile eval results + business metrics friday: - Sprint review: show before/after on key scenarios - Retrospective: what eval gaps did we discover? - Plan next sprint's eval set improvements
python # prompt_registry.py — Version control for prompts import json import hashlib from datetime import datetime from pathlib import Path class PromptRegistry: def __init__(self, registry_path: str = "prompts/"): self.path = Path(registry_path) self.path.mkdir(exist_ok=True) def register(self, name: str, prompt: str, metadata: dict = None) -> str: """Register a prompt version with hash-based versioning.""" version = hashlib.sha256(prompt.encode()).hexdigest()[:8] record = { "name": name, "version": version, "prompt": prompt, "created_at": datetime.now().isoformat(), "metadata": metadata or {}, "char_count": len(prompt), "word_count": len(prompt.split()), } filepath = self.path / f"{name}_v{version}.json" filepath.write_text(json.dumps(record, indent=2)) return version def compare(self, name: str, v1: str, v2: str) -> dict: """Diff two prompt versions.""" p1 = json.loads((self.path / f"{name}_v{v1}.json").read_text()) p2 = json.loads((self.path / f"{name}_v{v2}.json").read_text()) return { "v1_words": p1["word_count"], "v2_words": p2["word_count"], "delta_words": p2["word_count"] - p1["word_count"], "v1_date": p1["created_at"], "v2_date": p2["created_at"], }
RAG (Retrieval-Augmented Generation) is the backbone of most production GenAI applications. Each component of the RAG pipeline is a separate experiment axis:
Every 6-12 months, a new model generation launches (GPT-4 → GPT-4o → GPT-5, Claude 3 → Claude 4). Migration is a multi-sprint project:
yaml # model_migration_plan.yaml — Claude 3.5 → Claude 4 sprint_1_eval: - Run FULL eval suite on new model with existing prompts - Identify regressions (new model is different, not just better) - Benchmark latency and cost differences - Decision: is the upgrade worth the migration effort? sprint_2_prompt_adaptation: - Rewrite prompts for new model's capabilities/quirks - New model may need less hand-holding (remove workarounds) - New model may have different failure modes (add guardrails) - Run eval suite after each prompt revision sprint_3_integration: - Update API integration (new endpoints, parameters) - Update token budgets (new model may have different context window) - Load test under production traffic patterns - Canary deployment: 5% traffic to new model sprint_4_rollout: - Monitor canary for 1 week - Ramp to 50%, then 100% - Keep old model warm for 2 weeks (rollback safety) - Update documentation and model card
Track prompt versions and their eval scores across sprints. Each bar is a prompt version. Click New Prompt Version to simulate iterating on the system prompt.
Your team is building an AI agent that can research a topic, write a report, and email it to a customer. In testing, the agent works beautifully 85% of the time. The other 15%? It sends emails to the wrong person. It cites sources that don't exist. It writes a report about the wrong topic because it misinterpreted the request. And once, memorably, it entered an infinite loop and sent 47 emails before anyone noticed.
Agentic AI — systems where an LLM takes actions in the real world (calling APIs, executing code, modifying databases, sending communications) — is the most unpredictable type of AI project to manage. The failure modes aren't just "wrong answer." They're "wrong action with real-world consequences."
| Dimension | Traditional ML | Chatbot/LLM | Agentic AI |
|---|---|---|---|
| Failure mode | Wrong prediction | Wrong answer | Wrong ACTION (sends email, deletes data, charges money) |
| Blast radius | One user sees wrong result | One user gets bad answer | Agent modifies external systems irreversibly |
| Testing | Eval set, accuracy metrics | Human grading, automated evals | End-to-end trajectory testing, sandbox environments |
| Debugging | Check model weights, features | Read the prompt, check context | Trace multi-step reasoning across tool calls |
| Sprint predictability | Low (experiments) | Medium (prompt iteration is fast) | Very low (emergent behavior from tool combinations) |
yaml # agent_sprint_structure.yaml # Phase 1: Tool Integration (1-2 sprints per tool) tool_sprints: each_tool: - Define tool's API contract (input/output schemas) - Implement tool with error handling and rate limiting - Write unit tests for the tool in isolation - Write integration test: agent calls tool correctly - Write adversarial test: agent handles tool failure gracefully - Safety review: what happens if agent misuses this tool? # Phase 2: Behavior Testing (ongoing, every sprint) behavior_testing: trajectory_tests: - Define 50+ test scenarios with expected action sequences - Run agent in sandbox, record full trajectory - Grade: correct actions? correct order? no harmful actions? - Regression test: adding tool B didn't break tool A behavior adversarial_tests: - Prompt injection: user tries to make agent do unauthorized actions - Edge cases: what if the tool returns an error? - Loops: does the agent ever enter infinite tool-calling loops? - Scope creep: does the agent stay within its defined capabilities? # Phase 3: Safety Review Gates safety_gates: before_sandbox: "Agent can only call mock tools" before_staging: "Agent calls real tools but in test environment" before_production: "Full safety review, rate limits, kill switch"
python # agent_trajectory.py — Log and analyze agent action sequences from dataclasses import dataclass, field from typing import List, Optional from datetime import datetime @dataclass class AgentStep: step_num: int thought: str # Agent's reasoning tool_name: str # Which tool it called tool_input: dict # What it passed to the tool tool_output: str # What the tool returned timestamp: datetime = field(default_factory=datetime.now) @dataclass class Trajectory: task: str steps: List[AgentStep] = field(default_factory=list) final_output: Optional[str] = None success: Optional[bool] = None @property def tool_sequence(self) -> List[str]: return [s.tool_name for s in self.steps] @property def has_loop(self) -> bool: """Detect if agent repeated the same tool call 3+ times.""" for i in range(len(self.steps) - 2): if (self.steps[i].tool_name == self.steps[i+1].tool_name == self.steps[i+2].tool_name): if (self.steps[i].tool_input == self.steps[i+1].tool_input == self.steps[i+2].tool_input): return True return False @property def unauthorized_actions(self) -> List[AgentStep]: """Flag steps that used tools outside the allowed set.""" allowed = {"search", "read_doc", "write_report", "send_email"} return [s for s in self.steps if s.tool_name not in allowed]
Some systems use multiple agents that collaborate: a planner agent, a researcher agent, a writer agent, and a reviewer agent. Coordinating multi-agent systems in sprints requires treating agent interactions as integration points:
| Sprint activity | Single agent | Multi-agent |
|---|---|---|
| Testing | Test one agent's behavior | Test agent HANDOFFS: does agent A's output format match agent B's expected input? |
| Debugging | Read one trajectory | Trace across agent boundaries: which agent introduced the error? |
| Planning | One set of capabilities | Dependency graph: agent C can't be developed until agent A's API stabilizes |
| Safety | One agent's action space | Emergent behavior: agents A and B are safe alone, but together they escalate privileges |
| Failure mode | How to detect | How to fix |
|---|---|---|
| Infinite loop | Step count exceeds max (e.g., 20 steps) | Hard step limit + loop detection in trajectory logger |
| Wrong tool selection | Trajectory shows agent used "delete" when it should have used "update" | Better tool descriptions, few-shot examples in prompt |
| Scope creep | Agent performs actions not requested by user | Explicit instruction: "Only perform actions the user specifically requested" |
| PII exposure | Agent passes customer data to external tool | PII filter on all tool inputs. Block tool calls containing PII patterns. |
Watch an AI agent execute a multi-step task. Each node is a tool call. Green = successful step, red = failure, yellow = loop detection. Click Run Agent to simulate an execution.
It's 2 AM. PagerDuty fires. Your production model's accuracy has dropped from 91% to 67% over the last 6 hours. Customer complaints are flooding in. The support team is escalating. You check the monitoring dashboard: the model itself hasn't changed, but the input distribution has. A viral social media post is driving a new type of question your model was never trained to handle. This is data drift, and it's the most common production failure in AI systems.
AI systems face risks that traditional software doesn't. Your sprint process must include explicit checkpoints for each category:
| Risk category | Examples | Sprint checkpoint |
|---|---|---|
| Model regression | New model version performs worse on a subpopulation | Full eval suite before every model promotion |
| Data drift | Input distribution changes in production vs. training | Weekly distribution monitoring, alerting on drift metrics |
| Safety incidents | Toxic output, hallucinated facts, PII leakage | Safety eval gate before deployment + continuous monitoring |
| Bias detection | Model performs worse for certain demographics | Fairness eval: slice metrics by demographic category |
| Compliance | EU AI Act, GDPR data usage, industry regulations | Legal review gate before launch, quarterly compliance audit |
| Operational | GPU shortage, training failure, cost overrun | Compute budget tracking, cost alerts |
yaml # ai_risk_register.yaml — Maintained by AI Scrum Master risks: - id: RISK-001 category: data_drift description: "Input distribution shifts as product usage changes" likelihood: high impact: high current_status: mitigated mitigation: - "Weekly drift detection (PSI on top 20 features)" - "Alert when PSI > 0.2 on any feature" - "Retrain pipeline: triggered manually, evaluated automatically" sprint_checkpoint: "Weekly drift report in standup (Monday)" owner: "ML Engineer (Sarah)" - id: RISK-002 category: safety description: "LLM generates harmful content to vulnerable users" likelihood: medium impact: critical current_status: mitigated mitigation: - "Content safety classifier on all outputs (Llama Guard)" - "Block + log any output classified as harmful" - "Monthly adversarial red-team testing" sprint_checkpoint: "Safety metrics in every sprint review" owner: "AI Safety Lead (Marcus)" - id: RISK-003 category: bias description: "Model performs worse for non-English speakers" likelihood: high impact: high current_status: monitoring mitigation: - "Eval suite includes multi-language test set" - "Accuracy sliced by language in every eval run" - "If gap > 5% between languages, block deployment" sprint_checkpoint: "Fairness metrics in eval dashboard" owner: "Data Scientist (Priya)" - id: RISK-004 category: compliance description: "EU AI Act requires transparency for high-risk AI" likelihood: certain impact: medium current_status: in_progress mitigation: - "Model card documenting capabilities and limitations" - "Human-in-the-loop for high-stakes decisions" - "Audit trail: log all model inputs, outputs, and decisions" sprint_checkpoint: "Quarterly compliance review with legal" owner: "AI Scrum Master (You)"
python # drift_detector.py — Monitor input distribution changes import numpy as np from typing import Dict, List def population_stability_index(expected: np.ndarray, actual: np.ndarray, bins: int = 10) -> float: """Calculate PSI between training and production distributions. PSI < 0.1: no significant change PSI 0.1-0.2: moderate change, investigate PSI > 0.2: significant change, retrain needed""" # Bin the distributions breakpoints = np.linspace( min(expected.min(), actual.min()), max(expected.max(), actual.max()), bins + 1 ) expected_pcts = np.histogram(expected, breakpoints)[0] / len(expected) actual_pcts = np.histogram(actual, breakpoints)[0] / len(actual) # Avoid division by zero expected_pcts = np.clip(expected_pcts, 0.001, None) actual_pcts = np.clip(actual_pcts, 0.001, None) # PSI formula psi = np.sum((actual_pcts - expected_pcts) * np.log(actual_pcts / expected_pcts)) return float(psi) def check_drift(training_data: Dict[str, np.ndarray], production_data: Dict[str, np.ndarray], threshold: float = 0.2) -> Dict: """Check all features for drift. Returns alert if any exceed threshold.""" results = {} alerts = [] for feature in training_data: if feature not in production_data: continue psi = population_stability_index( training_data[feature], production_data[feature] ) status = "OK" if psi < 0.1 else "WARN" if psi < threshold else "ALERT" results[feature] = {"psi": round(psi, 4), "status": status} if status == "ALERT": alerts.append(f"{feature}: PSI={psi:.3f}") return {"features": results, "alerts": alerts, "needs_retrain": len(alerts) > 0}
| Incident type | Detection | Immediate action | Root cause |
|---|---|---|---|
| Accuracy drop > 10% | Monitoring alert | Rollback to previous model | Check for data drift, eval set contamination, infra issue |
| Toxic output reported | Content filter log, customer report | Add input to block list, escalate to safety team | Adversarial input, training data contamination, filter gap |
| PII in model output | PII scanner on outputs | Kill the response, notify affected user, log for compliance | PII in training data (data pipeline failure) |
| Agent performs unauthorized action | Trajectory logger, audit trail | Disable agent, review all recent actions | Prompt injection, missing guardrails, tool permission error |
Visualize your AI project's risk landscape. Each cell represents a risk category. Color intensity shows severity (likelihood × impact). Click risk categories to toggle mitigations.
Everything we've discussed comes together here. This is a living Kanban board designed for AI teams. It has the columns we defined in Chapter 2: Hypothesis, Experiment, Evaluation, Production, and Killed. But it also simulates the chaos of real AI sprints — blockers appear, experiments fail, stakeholders change scope, and GPUs run out.
| Control | What it does | What to watch |
|---|---|---|
| Advance Sprint | Moves time forward. Cards progress through columns based on probability. | Watch how experiments flow. Most should end up in "Killed" — that's healthy. |
| Add Experiment | Adds a new hypothesis card to the board. | Watch if the board gets overloaded. Too many active experiments = WIP limit exceeded. |
| Data Quality Issue | Injects a data blocker. Experiments in "Experiment" stage stall. | Watch the cascading effect: blocked experiments push back the entire sprint. |
| GPU Shortage | Injects a compute blocker. Only 1 experiment can run at a time. | Watch how the queue backs up. This is why compute planning matters. |
| Eval Regression | A promoted model fails regression testing. Bounces back to "Evaluation." | Watch the cost of late-stage failure. All the downstream work is wasted. |
| Scope Change | Stakeholder adds new requirements mid-sprint. | Watch how scope creep disrupts the experiment pipeline. |
A Kanban board for AI teams. Cards are experiments flowing through stages. Inject blockers to simulate real-world disruptions. Watch how the sprint adapts.
Every column in the simulation maps to a real workflow stage. Here's the production sprint board configuration:
yaml # ai_kanban_config.yaml columns: hypothesis: wip_limit: 5 card_fields: [hypothesis, success_criteria, compute_budget] exit_gate: "Dataset ready, baseline measured" experiment: wip_limit: 3 # Limited by GPU availability card_fields: [training_status, current_metric, compute_used] exit_gate: "Training complete, results logged" blockers: [data_quality, gpu_shortage, training_divergence] evaluation: wip_limit: 4 card_fields: [eval_results, regression_check, safety_check] exit_gate: "Kill/iterate/promote decision made + documented" production: wip_limit: 2 # Don't deploy too many models at once card_fields: [deploy_status, monitoring_status, rollback_tested] exit_gate: "Model serving in production, monitoring active" killed: wip_limit: none card_fields: [final_metric, kill_reason, learnings] exit_gate: none # Terminal state # Sprint metrics derived from board state: metrics: throughput: "Experiments completed (killed + promoted) per sprint" cycle_time: "Average days from Hypothesis to Decision" kill_rate: "% of experiments killed (healthy: 60-80%)" promotion_rate: "% of experiments promoted (healthy: 10-30%)" blocker_frequency: "Blockers injected per sprint (track over time)"
In an interview, you have 5 minutes to draw this board on a whiteboard. Here's the key talking points:
This chapter distills everything into a cheat sheet you can review in the 30 minutes before your interview. Every section maps to a common interview question type for AI Scrum Master / Technical Program Manager roles.
| Scenario | Key points to cover | Chapter |
|---|---|---|
| "Your AI team hasn't shipped anything in 3 sprints" | Diagnose: Are experiments running but failing? (Healthy.) Are experiments not starting? (Blocker.) Is the team afraid to kill experiments? (Process.) Switch from outcome-based to learning-based velocity. | 1, 2 |
| "The researcher says it will work, the engineer says it won't scale" | This is the research-to-prod gap. Run a production readiness checklist. Time-box the scalability investigation to 1 sprint. If it can't scale, it can't ship. | 6 |
| "Stakeholders want to launch the chatbot but accuracy is only 85%" | Frame the decision: what does 15% wrong look like? Show concrete failure examples. Quantify the business cost of errors vs. the cost of delay. Propose: launch with human-in-the-loop for low-confidence answers. | 3, 7 |
| "Design the sprint process for a new GenAI feature" | Hypothesis-driven tickets, eval-driven acceptance, prompt version control, RAG pipeline as experiment axis, safety review gate before deployment. | 2, 8 |
| "How do you manage risk for an AI agent in production?" | Trajectory logging, adversarial testing, rate limits, kill switch, human-in-the-loop for high-stakes actions, continuous monitoring of agent behavior. | 9, 10 |
| Format | Duration | What they test | How to prepare |
|---|---|---|---|
| Case study | 45-60 min | Given a scenario, design the sprint process | Practice with the scenarios above. Draw boards. |
| Behavioral | 30-45 min | "Tell me about a time you managed an AI project" | STAR method: Situation, Task, Action, Result. Quantify results. |
| Technical | 30-45 min | "Explain MLOps, eval pipelines, data versioning" | Review Chapters 3-6. Know the tools: W&B, MLflow, DVC. |
| Stakeholder sim | 30 min | Interviewer plays a frustrated PM or confused exec | Practice the translation framework from Chapter 7. |
| Whiteboard | 30-45 min | Draw the AI sprint board, experiment lifecycle | Practice drawing Chapter 11's board in 5 minutes. |
| Resource | Type | Why it matters |
|---|---|---|
| PSM I / PSM II (Scrum.org) | Certification | Baseline Scrum knowledge. Employers expect it. |
| SAFe Agilist | Certification | Enterprise-scale agile. Useful for large AI organizations. |
| "Accelerate" (Forsgren et al.) | Book | DORA metrics, deployment frequency, lead time. Apply to ML. |
| "Designing Machine Learning Systems" (Huyen) | Book | The best MLOps book. Covers the full production lifecycle. |
| "Building LLM Apps" (Huyen) | Book | Practical guide to GenAI systems. Eval, RAG, prompt engineering. |
| MLOps Community | Community | Slack + meetups. Stay current on tooling and practices. |
| Google "Rules of ML" | Guide | Martin Zinkevich's 43 rules. Timeless wisdom for ML projects. |
Click each dimension to see the key topics you should be able to discuss in an interview. This is your study guide.
When asked "Why should we hire you as an AI Scrum Master?", here's the structure:
text
"I've managed AI teams where 80% of experiments fail — and that's healthy.
My approach differs from traditional Scrum in three ways:
1. HYPOTHESIS-DRIVEN tickets instead of user stories. Each sprint,
we commit to running N experiments, not shipping N features.
Success is measured in learning velocity, not story points.
2. EVAL-DRIVEN acceptance. Every model change runs through an
automated eval suite with statistical significance testing.
We don't ship on vibes — we ship on data.
3. STAKEHOLDER TRANSLATION. I convert 'the model is 87% accurate'
into '13 out of 100 customers get wrong answers, costing us
$X per week.' Leadership makes decisions on business impact,
not ML metrics.
I also embed responsible AI throughout the sprint — bias checks,
safety evals, drift monitoring — not as an afterthought but as
standard sprint activities."