AI Engineering

AI Evaluation & Evals

"It seems to work" is not a metric. Evals are the test suite for AI — how you know your system is actually good, catches regressions before users do, and makes data-driven decisions about what to ship.

Prerequisites: Basic Python + curiosity about AI quality. No ML theory needed.
10
Chapters
10+
Simulations
0
Assumed ML

Chapter 0: Why "It Seems to Work" Isn't a Metric

You've built an AI system. A question-answering chatbot, a summarizer, a code assistant. You prompt it with a few examples and it gives solid answers. Your manager is impressed. You ship it.

Three weeks later, a customer reports it confidently gave them wrong legal information. Another says it summarizes documents by dropping the most important numbers. A third says every response reads like a robot wrote it. The responses "seemed fine" in your five test examples — but you were testing the happy path, not the edge cases.

The core problem: LLMs are non-deterministic, broad-coverage systems. They behave differently on inputs you haven't tested. You cannot manually inspect enough outputs to know if your system is actually good. You need systematic, repeatable measurement.

This is where evals come in. An eval is a test suite for AI — a structured set of inputs, grading criteria, and aggregated scores that tells you exactly how well your system performs on a defined set of behaviors. Evals let you catch regressions automatically, compare versions objectively, and make deployment decisions with data rather than intuition.

The Vibe Check vs. Eval Spectrum

Most teams start with vibe checks: run a few examples, it looks good, ship it. Real eval practices look like this:

PracticeVibe CheckStructured Eval
Test cases5–10, cherry-picked50–500+, representative sample
Scoring"Looks good to me"Graders with rubrics, aggregated scores
CoverageHappy path onlyEdge cases, adversarial, distribution shift
Regression detectionNoneCI blocks on score drop
ReproducibilityNoneVersioned datasets, deterministic scoring
Decision basisGut feelingStatistical confidence intervals
The Vibe Check Trap

A model scores well on cherry-picked examples but fails on the real distribution. Drag the slider to see how test coverage affects your confidence in the system score.

Test coverage 5%

The simulation above shows why five examples fool you: you happen to pick the easy ones. As coverage grows, you discover the harder distribution — and the real score drops. Evals give you that honest picture before users do.

What's the core reason vibe checks fail for AI systems?

Chapter 1: What to Measure

Before you build any eval, you need to answer: what does "good" mean for your system? This sounds obvious but it's easy to measure the wrong things. A chatbot with 90% grammatical correctness might still be useless if it answers the wrong question. A summarizer with low latency is worthless if it drops key facts.

There are seven core dimensions every AI system can be measured on. You won't need all seven — choose the ones that match your failure modes.

The Seven Dimensions

Correctness
Is the output factually accurate? Does it answer what was asked? This is the hardest to define and measure.
Faithfulness
For RAG/summarization: does the output stay grounded in the source? Does it hallucinate facts not in the context?
Relevance
Does the response address the user's actual intent? A correct answer to the wrong question scores zero here.
Helpfulness
Would a real user find this useful? Combines correctness + relevance + format. Often measured by preference ratings.
Safety
Does it refuse harmful requests? Does it avoid generating dangerous content, PII, or policy violations?
Latency
How long until the first token? How long until full response? p50, p95, p99 latency — not just average.
Cost
Tokens in + tokens out × price per token. Total per query, total per day, cost per successful task completion.
The tradeoffs are real: Bigger models are more correct but slower and more expensive. More cautious safety filtering reduces harmful outputs but increases false refusals. You can't optimize all seven dimensions simultaneously — pick the ones that hurt most if they fail.

Picking Your Metrics

For a customer service bot: correctness + faithfulness + safety. For a code assistant: correctness (does it run? does it pass tests?) + latency. For a RAG pipeline: faithfulness + relevance + latency. For a creative writing assistant: helpfulness + safety. The metric choice follows from the use case — always ask: "what would make this system useless or harmful?"

python
# Defining your eval dimensions upfront
from dataclasses import dataclass
from typing import List

@dataclass
class EvalCase:
    input: str          # the prompt/question
    expected: str       # reference answer (optional)
    context: str        # source document (for RAG)
    metadata: dict      # tags: difficulty, category, etc

@dataclass
class EvalResult:
    case_id: str
    output: str
    correctness: float  # 0.0–1.0
    faithfulness: float # 0.0–1.0
    latency_ms: int
    cost_usd: float
    grader_notes: str
A RAG-based legal assistant answers questions by reading uploaded contracts. Which two dimensions should be weighted MOST heavily in its eval?

Chapter 2: Human Eval

The gold standard for AI evaluation is a human reading the output and deciding if it's good. Machines can approximate this, but for many dimensions — helpfulness, tone, genuine usefulness — humans are still the ground truth. The question is how to collect human judgments in a way that's reliable and scalable.

Rating Scales

The simplest approach: show a rater the input + output, ask them to score it 1–5. Problems: "3" means different things to different raters. One rater's 4 is another's 2. This inter-annotator disagreement is the core challenge of human eval.

Solutions: use anchored scales ("1 = completely wrong, 3 = partially correct with errors, 5 = fully correct and helpful"), train raters with calibration examples, use odd-numbered scales (forces non-neutral choice), and measure Cohen's Kappa — the inter-annotator agreement statistic that corrects for chance agreement.

κ = (Po − Pe) / (1 − Pe)

Where Po is the observed agreement fraction and Pe is the expected agreement by chance. κ > 0.6 is considered substantial agreement; κ > 0.8 is near-perfect. If your raters agree less than 0.4, your rubric is ambiguous and needs revision.

Pairwise Comparison

Instead of absolute ratings, show raters two outputs side-by-side and ask "which is better?" Pairwise comparisons are much more reliable than absolute ratings because it's easier to compare than to measure. This is how RLHF preference data is collected and how chatbot arena leaderboards work.

Think of it this way: Asking "rate this movie 1–10" is hard. Asking "did you prefer movie A or movie B?" is easy. The same logic applies to LLM output quality — pairwise comparison has lower cognitive load and higher consistency.

ELO Ratings from Pairwise Comparisons

If you collect many pairwise comparisons across multiple model versions, you can compute ELO scores — the same rating system used in chess. Each comparison updates both models' scores: the winner's score increases, the loser's decreases, proportional to the upset probability.

python
def update_elo(winner_elo: float, loser_elo: float, k: float = 32) -> tuple:
    """Update ELO scores after one pairwise comparison."""
    expected_win = 1 / (1 + 10 ** ((loser_elo - winner_elo) / 400))
    new_winner = winner_elo + k * (1 - expected_win)
    new_loser  = loser_elo  + k * (0 - (1 - expected_win))
    return new_winner, new_loser

# Example: model_a=1200, model_b=1000. model_a wins (expected)
a, b = update_elo(1200, 1000)  # a gains ~10pts, b loses ~10pts

# model_b upsets model_a
b, a = update_elo(1000, 1200)  # b gains ~22pts (big upset), a loses ~22pts
Inter-Annotator Agreement Simulator

Adjust rater agreement to see how Cohen's Kappa changes. Below 0.4 is "fair" — your rubric needs work.

% agreement 60%
Why do AI researchers prefer pairwise comparisons over absolute rating scales for human eval?

Chapter 3: Automated Metrics

Human eval is expensive and slow. For CI/CD pipelines, you need something that runs in seconds. Automated metrics compare system outputs to reference outputs using algorithms — no human in the loop. The catch: they're approximations, and some are very bad approximations.

BLEU — Bilingual Evaluation Understudy

BLEU counts n-gram overlaps between generated text and reference text. If the reference says "the cat sat on the mat" and the output says "a cat sat on a rug," BLEU counts matching 1-grams (cat, sat, on), 2-grams (cat sat), etc., and computes a geometric mean with a brevity penalty.

BLEU = BP · exp(∑n=1N wn log pn)

Where pn is the precision of n-grams and BP penalizes outputs shorter than the reference. BLEU ranges 0–1; above 0.3 is generally "decent" for translation tasks.

ROUGE — Recall-Oriented Understudy for Gisting Evaluation

ROUGE is similar to BLEU but recall-oriented — it measures what fraction of the reference n-grams appear in the output. ROUGE-1 uses unigrams, ROUGE-2 uses bigrams, ROUGE-L uses the longest common subsequence. ROUGE is the standard for summarization evaluation.

BERTScore

BERTScore replaces exact string matching with semantic similarity. It embeds both the reference and the output using a BERT model, then computes precision, recall, and F1 over token embeddings. "The cat sat" and "A feline rested" score high in BERTScore but low in BLEU — semantic equivalence without word overlap.

python
from bert_score import score as bert_score
from rouge_score import rouge_scorer
import sacrebleu

refs  = ["The patient reported no adverse side effects."]
hyps  = ["The patient noted no negative side effects."]

# BLEU
bleu = sacrebleu.corpus_bleu(hyps, [refs]).score
print(f"BLEU: {bleu:.1f}")  # ~47.3 — misses "reported"→"noted" equivalence

# ROUGE-L
scorer = rouge_scorer.RougeScorer(['rougeL'], use_stemmer=True)
r = scorer.score(refs[0], hyps[0])
print(f"ROUGE-L F1: {r['rougeL'].fmeasure:.2f}")  # ~0.79

# BERTScore — semantic match
P, R, F = bert_score(hyps, refs, lang="en")
print(f"BERTScore F1: {F.mean():.3f}")  # ~0.955 — recognizes synonym equivalence
When automated metrics fail: BLEU/ROUGE reward lexical similarity, so they penalize valid paraphrases and reward hallucinations that use the right words. A response that copies the reference word-for-word but adds a critical hallucination in the middle can still score well. Always combine with human eval for high-stakes applications.
Metric Comparison

Five hypothetical outputs for the same reference. See how BLEU, ROUGE, and BERTScore disagree — especially on paraphrases and hallucinations.

You're evaluating a medical summarizer. The output says "aspirin reduces cardiovascular risk" when the source says "aspirin may reduce cardiovascular risk in certain populations." BLEU gives this a high score. Why is BLEU misleading here?

Chapter 4: LLM-as-Judge

Automated metrics measure lexical overlap, not meaning. Human eval measures meaning, but it's slow. The idea behind LLM-as-judge is: use a strong LLM (GPT-4, Claude 3 Opus) to evaluate the outputs of a weaker or different model. You get semantic understanding at machine speed.

The judge LLM receives: the original input, (optionally) a reference answer, and the model's output. It then scores the output according to a rubric you define, and optionally provides a chain-of-thought explanation of its reasoning.

Rubric Design

The rubric is everything. A vague rubric ("rate this response 1–5") produces noisy, unreliable scores. A precise rubric with anchored examples produces scores that track human judgment well. Here's the pattern:

python
JUDGE_PROMPT = """You are evaluating a customer support chatbot response.

Input: {input}
Response: {response}

Rate the response on FAITHFULNESS (1–3):
1 = Contains claims not supported by the input. Example: "Your order ships in 2 days" when no shipping info was provided.
2 = Mostly grounded but hedges or adds minor unsupported detail.
3 = All claims directly supported by or derivable from the input.

Respond with JSON:
{{"score": <1|2|3>, "reason": ""}}"""

import json, openai

def judge_faithfulness(input_text: str, response: str) -> dict:
    prompt = JUDGE_PROMPT.format(input=input_text, response=response)
    out = openai.chat.completions.create(
        model="gpt-4o",
        messages=[{"role":"user", "content":prompt}],
        temperature=0,      # deterministic for reproducibility
    )
    return json.loads(out.choices[0].message.content)

Calibration and Bias

LLM judges have known biases you must compensate for. Position bias: if you show two responses, the judge tends to prefer the first one shown. Fix: always run comparisons in both orders and average. Length bias: judges prefer longer responses, even when they're padded. Fix: explicitly instruct "do not consider length when scoring." Self-preference bias: GPT-4 prefers GPT-4 outputs. Fix: use a different family (Claude to judge GPT outputs, or vice versa).

Calibration process: Before trusting your judge at scale, calibrate it against human ratings on 50–100 examples. Plot judge scores vs human scores. If the correlation is above 0.7, your judge is reliable. If not, revise the rubric or switch models.
Judge Bias Visualizer

Simulate how position bias and length bias affect judge scores. Run both orders to see the discrepancy.

Click to simulate judge evaluation orders
You're using Claude to judge GPT-4 outputs vs Claude outputs in a pairwise comparison. What bias should you most worry about and how do you fix it?

Chapter 5: Eval Datasets

Your eval is only as good as your dataset. A hundred test cases that all come from the product demo script will not catch real failures. Building an eval dataset is an engineering discipline with its own quality bar.

What Makes a Good Eval Dataset

A good eval set is representative: it covers the real distribution of user inputs, not just the easy or common ones. It includes edge cases — inputs that probe specific failure modes. It includes adversarial examples — inputs crafted to expose weaknesses. And it's versioned so you can track score changes over time.

Sampling Strategy

The worst way to build an eval set: have engineers write examples from memory. The best way: mine from real user logs (with privacy safeguards). If you don't have user logs yet, use stratified sampling: define categories (easy/medium/hard, topic A/B/C, formats), then sample proportionally from each stratum. Reserve 20% for held-out regression testing — never touch these with prompt engineering decisions.

python
import random
from collections import defaultdict

def stratified_sample(items: list, strata_key: callable, n_per_stratum: int) -> list:
    """Sample n items from each stratum defined by strata_key(item)."""
    buckets = defaultdict(list)
    for item in items:
        buckets[strata_key(item)].append(item)

    result = []
    for stratum, bucket in buckets.items():
        sample = random.sample(bucket, min(n_per_stratum, len(bucket)))
        result.extend(sample)
    return result

# Mine logs, stratify by topic + difficulty
sampled = stratified_sample(
    user_logs,
    strata_key=lambda x: (x["topic"], x["difficulty"]),
    n_per_stratum=15
)

The Contamination Problem

Contamination occurs when your eval data leaks into your training or prompt engineering process. If you iterate on prompts by testing on your eval set, those test cases are no longer independent — the prompt is now overfit to them. The fix: maintain a strict separation between your "development set" (used for iteration) and your "test set" (used only for final, infrequent reporting).

Versioning is not optional: Every eval dataset should be stored in git with a semantic version number. When scores change, you need to know whether it was because the model changed or the dataset changed. Mixing both is how teams lose months to confusion.
Dataset Coverage Visualizer

A good eval dataset covers all quadrants of difficulty × topic. Sparse coverage means blind spots. Click to add cases.

0 cases added
You've been iterating on your system prompt for two weeks using a 100-case eval set and your score went from 0.6 to 0.85. What problem should you be most worried about?

Chapter 6: Regression Detection

You update your system prompt. You upgrade your LLM provider. You add a new retrieval step to your RAG pipeline. Any of these changes could improve performance on the cases you tested but silently degrade performance on others. Regression testing runs your full eval suite automatically on every change and blocks deployment if scores drop.

Think of it as CI/CD for AI. Software engineers have had unit tests running in CI for decades. AI engineers need the same discipline — but instead of pass/fail tests, we have score thresholds.

Statistical Significance

This is where most teams make a critical mistake. If your baseline score is 0.82 and the new version scores 0.79, did you regress? Not necessarily. LLM responses have variance — run the same eval twice and you'll get slightly different scores. You need to know if the delta is larger than the noise.

The standard approach: use a paired t-test or bootstrap confidence interval on the per-case score differences. If the 95% CI of the mean difference excludes zero, the change is statistically significant.

python
from scipy import stats
import numpy as np

def regression_test(baseline_scores: list, new_scores: list, alpha: float = 0.05) -> dict:
    """Paired t-test for regression detection."""
    diffs = np.array(new_scores) - np.array(baseline_scores)
    t_stat, p_value = stats.ttest_1samp(diffs, popmean=0)
    ci = stats.t.interval(
        1 - alpha,
        df=len(diffs) - 1,
        loc=np.mean(diffs),
        scale=stats.sem(diffs)
    )
    return {
        "mean_delta": np.mean(diffs),
        "p_value": p_value,
        "ci_95": ci,
        "significant": p_value < alpha,
        "direction": "improvement" if np.mean(diffs) > 0 else "regression"
    }

# If result["significant"] and result["direction"] == "regression": block the PR

CI Pipeline Setup

In practice: store your eval dataset in a repository. On every PR, run the eval suite against the new code/prompt. Compare scores to the previous commit's scores using the statistical test above. If there's a statistically significant regression on any tracked metric, the CI check fails and the PR cannot merge.

The operational discipline: Running evals once at launch is not enough. Every prompt change, model version bump, or retrieval config update needs an eval run. The 30–60 minutes it takes to run evals before each deployment will save you from hours of incident response.
Regression Detection Simulator

Simulate a series of model versions. Some have real regressions; some just look like regressions due to variance. See when the test detects the real change.

Click "Run next version" to simulate
Your eval score drops from 0.84 to 0.81 after a prompt change. Without a significance test, you conclude it's a regression and revert. What's wrong with this decision?

Chapter 7: A/B Testing

Offline evals tell you how your system performs on a fixed dataset. But the real question is: how does it perform on actual users? Users are the ultimate ground truth, and their behavior tells you things no eval dataset captures. A/B testing (online eval) routes live traffic to two versions of your system and measures the difference in real outcomes.

Interleaving vs. Holdout Split

The naive approach is a holdout split: 50% of users see version A, 50% see version B. The problem: users differ. Maybe version B gets more experienced users. Interleaving is better for search/retrieval: show results from both versions in a single response (blended), and track which results users click. Because both versions compete for the same user's attention, user-level variability cancels out.

Think of it this way: A/B holdout is like testing two menus at two restaurants with different customers. Interleaving is like putting both menus on the same table — now the same customer chooses, which is a much fairer test.

Statistical Power

The core question: how many users do you need to run an A/B test? This depends on three things: the baseline conversion rate, the minimum effect size you want to detect (your minimum detectable effect, or MDE), and the desired statistical power (typically 80–90%). Use a power calculator before running your test, not after.

python
from statsmodels.stats.power import TTestIndPower

def required_sample_size(
    effect_size: float = 0.05,  # MDE: 5% relative improvement
    baseline: float = 0.30,    # baseline success rate
    power: float = 0.80,       # 80% power
    alpha: float = 0.05        # 5% significance level
) -> int:
    # Cohen's h for proportions
    p2 = baseline * (1 + effect_size)
    h = 2 * (np.arcsin(np.sqrt(p2)) - np.arcsin(np.sqrt(baseline)))
    analysis = TTestIndPower()
    n = analysis.solve_power(effect_size=h, power=power, alpha=alpha)
    return int(np.ceil(n))

# Detect a 5% improvement in resolution rate at 80% power
n = required_sample_size()
print(f"Need {n} users per variant = {2*n} total")  # ~1,200 per variant

Guardrail Metrics

When running A/B tests, always monitor guardrail metrics alongside your primary metric. Guardrail metrics are ones you can't allow to degrade: safety violation rate, refusal rate on valid requests, latency p95. An experiment that improves helpfulness by 8% but doubles the safety violation rate must be stopped, even if the primary metric looks great.

A/B Test Power Calculator

Drag sliders to see how effect size and baseline rate affect required sample size. Small improvements require many more users.

Baseline rate 30%
Min effect (%) 5%
Your A/B test shows version B improves task completion rate by 6% (p=0.02). But guardrail monitoring shows safety violations increased from 0.1% to 0.4%. What should you do?

Chapter 8: Eval Dashboard — Full System Simulation

Everything comes together here. You have a simulated AI system that answers customer service questions. You have 50 eval cases. You have two model versions (v1 and v2). Run the eval, see the scores across all dimensions, detect the regression, and compare versions — all in one dashboard.

Press Run Eval v1 to evaluate the first model. Then Run Eval v2 to see a version with an injected regression. The dashboard shows you exactly which cases degraded and which metrics were affected.

Eval Dashboard — 50 Cases, 4 Metrics, 2 Versions

Run evals on both versions to see aggregate scores, per-case results, and regression detection. Hover over bars to see which category degraded.

What You're Seeing

The dashboard shows four metrics (correctness, faithfulness, relevance, safety) across 50 simulated cases grouped into five categories (billing, returns, shipping, policy, technical). Version 2 has a regression: its faithfulness score drops on billing and policy categories because a new prompt change made it overconfident when no source document supports its claims.

The regression is caught automatically because the CI score threshold (0.75 faithfulness) is violated. Without this dashboard, the change would have shipped and caused real hallucinations in the product.

The core loop: define metrics → build dataset → run evals → set thresholds → automate in CI → investigate regressions → fix and re-run. This loop is the engineering discipline that separates "it seems to work" from "we know it works."
Version 2 scores 0.73 on faithfulness (below the 0.75 threshold) but 0.91 on correctness (up from 0.88). Should v2 be deployed?

Chapter 9: Connections

Evals don't exist in isolation. They're the diagnostic layer that connects every other AI engineering discipline. The quality of your eval system determines the quality of every decision you make about your AI product.

Evals → RAG

When your RAG pipeline has a faithfulness regression, evals tell you exactly which cases failed. You then debug the retrieval step (wrong chunks retrieved), the context assembly (important context truncated), or the generation step (model ignores retrieved context). Without evals, you'd be debugging blind. See Multimodal RAG for the retrieval mechanics.

Evals → Agents

Agent evals are harder — the output isn't just text, it's a sequence of actions with state changes. You need to eval task completion rate, tool call correctness, and step efficiency simultaneously. See Agent Evaluation for the specialized methods (pass@k, tau-bench, Swiss cheese model) that handle agentic complexity.

Evals → Fine-Tuning Decisions

Fine-tuning is expensive and irreversible in the short term. Before fine-tuning, run a targeted eval on the behavior you want to improve. If the eval score is already above 0.85, fine-tuning may not be worth the cost. If it's below 0.6, identify whether the problem is data (the model hasn't seen this pattern) or capability (the model can't do this even with examples) — each requires a different fix.

Evals → Safety

Safety is just another eval dimension, but it gets special treatment: it's a guardrail metric, meaning any regression blocks deployment. Safety evals should include adversarial examples, jailbreak attempts, and edge cases that probe policy boundaries. See AI Safety & Guardrails for the layered defense system.

Eval connects to...What it measuresKey signal
RAG pipelinesFaithfulness, context utilizationHallucination rate per category
Agent systemsTask completion, tool use accuracyPass@1 and pass^1
Fine-tuningPre/post score on target behaviorEffect size of fine-tuning
Safety guardrailsViolation rate, refusal accuracyFalse positive + false negative rates
Model selectionCost-performance tradeoffScore per dollar, score per second
Prompt engineeringIteration quality signalDev set score trajectory
"What I cannot measure, I cannot improve." — Kelvin's principle, adapted for AI. Evals are not bureaucratic overhead. They're the scientific method applied to AI systems: hypothesis (this change will improve quality), measurement (run the eval), conclusion (ship or revert). Without evals, you're not engineering — you're guessing.

What to Build Next

You now have a complete mental model of AI evaluation. The practical next step: pick one system you're working on, define two metrics, build a 30-case eval set, and run it. Don't try to build a perfect system — start with something imperfect that runs in CI. Every eval cycle will reveal what to measure next.

Step 1
Define 2 metrics that matter most for your system's failure modes
Step 2
Build 30 cases: 20 typical, 5 edge cases, 5 adversarial
Step 3
Choose a grader: exact match for factual, LLM-as-judge for semantic
Step 4
Set thresholds. Run in CI. Block on regression.
↻ iterate
Step 5
Grow the dataset as you discover new failure modes from production