This note covers the evaluation discipline for AI systems: how to measure whether an AI's outputs are good, keep them from silently getting worse over time, and build a trustworthy answer key to measure against. It spans expert labeling, golden datasets, regression testing, and the supporting practices around them — automated judging, multi-level evals, the offline/online split, and handling randomness. It deliberately excludes how the model is built or improved; the focus is strictly on how it is judged.
Unlike traditional software, AI systems are probabilistic, so they can't be tested with exact assertions. They operate on probabilistic reasoning and produce variable outputs across runs.[1] The same input can return different outputs; "correct" is often fuzzy; and behavior drifts over time even when nothing in your own code changes, because the underlying model can be updated out from under you — model drift occurs when an upstream provider's update changes behavior with no change on your side.[2] The job, therefore, is to define "good," measure it repeatably, and detect degradation before users do.
A domain expert reviews a response and produces a judgment — ideally the ideal output plus a rationale for why it is right or wrong, not merely a thumbs up or down. Because expert judgments are noisy and two experts often disagree, mature pipelines use multiple labelers, an explicit rubric, and a measure of agreement between them. The label is only as trustworthy as the consistency behind it.
The trusted, curated subset of those labels becomes the answer key — trusted inputs paired with ideal outputs, hand-labeled by people with domain expertise, serving as the benchmark for output quality.[3] It is versioned, and deliberately stocked with hard, grey-area, and borderline cases.[4] A consequence to expect: because it is weighted toward difficulty, its metrics run biased downward, so a 75% golden-set score can coexist with 92% real-world accuracy by design.[4]
You run the system against the golden set whenever something changes — a prompt edit, a retrieval-pipeline tweak, or a new model version from your provider — and compare scores to the historical baseline. It relies on algorithmic scoring rather than human intuition to measure accuracy, relevance, and safety across iterative versions.[5] This is what catches the silent few-percent drop that no single test would reveal.
Because human labeling does not scale, a strong model can grade outputs against a rubric. The catch most teams miss: the judge itself must first be validated against the human-labeled golden set, or you are simply trusting one unverified model to grade another.[3]
You do not only score the final answer. For agents especially, you evaluate the intermediate steps too — session-level agent evals, trace-level retrieval evals, and span-level tool outcomes[6] — because an agent can reach a correct answer through broken reasoning or a bad tool call.
Offline evaluation runs during development on curated datasets before deployment; online evaluation monitors live traffic in production.[2] The golden set is your offline gate; production monitoring is the online watch. The two feed each other.
Identical inputs produce varying outputs, so evals become "flaky." You manage this by lowering temperature for reproducibility where possible, and otherwise aggregating scores across multiple runs to separate signal from noise.[2] Treat a score drop as a statistical question, not a single pass/fail.
These pieces form a self-reinforcing loop. Production reveals failures; triage decides which ones permanently raise the bar; the hardened golden set then gates every future change before it ships.
The throughline: labeling defines what "good" means, the golden set freezes that definition into a fixed answer key, regression testing measures every change against it, and the production-to-triage feedback step keeps the answer key growing harder over time. Because you mostly add to the golden set and rarely remove from it, your reliability bar ratchets steadily upward.
This loop is only as good as the labeling consistency feeding it. A sloppy or ambiguous rubric poisons everything downstream — the golden set inherits the noise, and every score built on it inherits the doubt.
Judging "did this actually improve" under non-determinism is genuinely hard. That is why the statistical treatment of scores is not a footnote but load-bearing: without it, you cannot tell a real regression from run-to-run noise.