AI Evals for PMs: How to Know Your AI Feature Works | PM Toolkit

Q: Why your A/B testing instincts don't transfer directly?

If you have internalized [A/B testing](/learn/experimentation/ab-testing-guide), you already hold most of the statistical intuition evals require. Sampling error, confidence intervals, the temptation to stop early when the numbers look good: all of it carries over. What changes is the thing being measured. A conversion funnel has one question per step: did the user click or not. An AI answer has several questions at once. Is it factually correct? Is the tone right for the audience? Did it arrive

Run the same prompt through the same model twice and you can get two different answers. That one property breaks most of the measurement habits PMs built on deterministic software, where a given input always produced the same output and a button either converted or did not. The discipline that replaces those habits is the eval: a repeatable test that scores AI outputs against a definition of "good." Engineers build the harness, the PM owns the definition, and a team without that definition is shipping on vibes.

Why your A/B testing instincts don't transfer directly

If you have internalized A/B testing, you already hold most of the statistical intuition evals require. Sampling error, confidence intervals, the temptation to stop early when the numbers look good: all of it carries over. What changes is the thing being measured.

A conversion funnel has one question per step: did the user click or not. An AI answer has several questions at once. Is it factually correct? Is the tone right for the audience? Did it arrive within the latency budget? Did it stay inside safety policy? A single output can pass two of those and fail the others, and a single conversion metric collapses the distinction. Your checkout test never had to decide whether a purchase was "rude."

The second shift is that there is no fixed variant. In a classic test, variant B is the same pixels for every user. In an AI feature, "variant B" is a prompt-and-model combination that produces a different output for every input, and sometimes a different output for the same input. You are comparing two probability distributions of behavior rather than two fixed screens, and you need enough graded samples from each to say anything about the difference.

A/B tests still matter for AI features. They remain the right tool for the question "does this change move retention or task completion in production." Evals answer the question that comes before it: "is this version good enough to put in front of users at all."

Three ways to grade an output

Every eval reduces to the same loop: take an input, generate an output, grade it. The grading method is the design decision, and there are three.

Grading method	Cost per output	Speed	Reproducible	Catches	Misses
Code-based checks (exact match, regex, schema validation, latency assertions)	Near zero	Milliseconds	Fully	Broken JSON, banned phrases, missing citations, blown latency budgets	Anything needing judgment: tone, relevance, subtle wrongness
LLM-as-judge (a model grades the output against a rubric)	Cents	Seconds	Mostly, with a pinned judge model and rubric	Relevance, instruction-following, faithfulness to a source document	Failure modes the judge model shares with the model being graded
Human review	Dollars, plus reviewer time	Hours to days	Low, since reviewers disagree	Domain correctness, real harm, the failure nobody predicted	Scale. Humans cannot grade every output of every run

Code-based checks are the floor. They are cheap enough to run on every commit, and Hamel Husain's widely shared essay on the topic argues most teams under-invest in exactly this unglamorous layer before reaching for anything fancier¹.

LLM-as-judge is the workhorse for everything code cannot grade. The approach has real evidence behind it: Zheng et al. found GPT-4 judgments agreed with human raters more than 80 percent of the time on chat-quality comparisons, about the same rate at which humans agree with each other². The same paper documents the judge's weaknesses: position bias (preferring the first answer shown), verbosity bias (preferring longer answers), and self-preference when grading outputs from its own model family².

Calibrate the judge before you trust it. Take 50 outputs your team has hand-graded, run the judge on the same 50, and measure agreement. If the judge and your reviewers disagree on more than roughly one in five, fix the rubric before you automate anything. An uncalibrated judge gives you a precise pass rate that happens to be wrong.

Human review is the most expensive and the least dispensable. It is where new failure modes get discovered, and it is the ground truth the other two methods get calibrated against. Most working setups use all three in a stack: code checks on every commit, judge runs on every release candidate, humans on a weekly sample and on every incident.

Offline gates the release, online finds what you missed

An offline eval runs before release: a fixed dataset, a scoring method, a pass rate. It answers "did this prompt change make the product better or worse" without touching a user. This is your release gate, the AI equivalent of a regression suite.

An online eval watches live traffic after release: sampled outputs run through the judge, thumbs-up and thumbs-down signals, escalation rates, edit distance between the AI draft and what the user actually kept. It answers a question the offline set cannot, because the offline set only contains failures you already know about, and users keep finding new ones. Most production teams run both, and sample a slice of live traffic (5 to 15 percent is typical) for the judge to keep grading costs sane³.

The loop between the two is the part teams skip. A failure surfaces online, someone patches the prompt, and the example never enters the offline set. Six weeks later a different prompt change reintroduces the same failure and the release gate waves it through, because the gate never learned about it.

Make the loop a standing ritual: every online failure that gets triaged also gets added to the golden dataset, the same week. Your offline gate should always be guarding against last month's surprises, not just launch-day guesses.

If your product chains multiple model calls, the offline story gets more granular. Step-level and handoff evals for agent pipelines are covered in building agentic products.

Building a golden dataset

A golden dataset is the fixed set of inputs (and expected behaviors) your offline evals run against. Two decisions matter: what goes in it, and how big it is. Most of the attention goes to size, but the contents decide whether the set catches anything.

Start from real failures. Support tickets where the AI answer was wrong. Thumbs-down events with the output attached. The screenshots people post in your internal Slack channel. Synthetic happy-path examples confirm the model can do what it already does. Failure-sourced examples test the boundary, and the boundary is where the next regression lives. Practitioners keep landing on the same advice: define evals from the failure states you find, not from a list you brainstorm in a room, or every eval passes while customers stay unhappy⁴.

A useful first golden set is 30 examples: 10 from support escalations, 10 from thumbs-down events, 10 edge cases your engineers already worry about (empty input, hostile input, a question outside the product's scope). You can assemble it in an afternoon and it will outperform 500 generated happy paths.

Coverage comes first, but size is not optional, because your sample-size intuition still applies. Suppose your assistant passes 90 percent of a 100-example set, an engineer changes the prompt, and the rate reads 87 percent. Regression or noise? On 100 examples, the 95 percent confidence interval around a 90 percent pass rate is roughly plus or minus 6 points, so 87 is statistically indistinguishable from 90. To reliably detect a true drop from 90 to 85 percent at the standard 95 percent confidence and 80 percent power, you need around 700 examples per version. It is the same binomial math you use to size an A/B test, so run your own scenario here.

Interactive Calculator

The practical resolution: grow the set over time via the online loop, and treat small-set comparisons as smoke tests rather than verdicts. A 30-example set can tell you a prompt change broke JSON formatting, but it cannot distinguish 87 from 90.

The PM's actual job

You will not write the harness. Here is what is yours.

Define success in terms of the user's task. A model can score well on public benchmarks and still fail your users, because MMLU does not contain your refund policy. Write the rubric the judge uses, in plain language, anchored to what a user came to do: "the answer cites the correct plan tier," "the summary preserves every action item," "the reply never promises a timeline support cannot honor." If you cannot write that sentence, no eval can measure it for you.

Decide what pass rate ships, by severity. A flat threshold ("95 percent overall") hides the distinction between a formatting miss and a hallucinated refund commitment. Tier the failures. A workable starting policy: safety and policy violations gate at 100 percent (one failure blocks the release), task-correctness failures gate at a negotiated rate, style misses get tracked but never block. The thresholds are product decisions about acceptable harm, which is exactly why an engineer should not be setting them alone.

Connect eval scores to product metrics. An offline pass rate that climbs quarter after quarter while retention and task completion sit flat means the eval is measuring something users do not feel. Treat the eval suite itself as a hypothesis ("outputs that pass this rubric produce more completed tasks") and check it the way you would any other hypothesis. When the correlation breaks, rewrite the rubric.

Put one line in every AI feature PRD: "This ships when the golden set passes at X percent, with zero severity-1 failures, measured on dataset version Y." That sentence forces the threshold conversation before launch week, when it is still cheap to have.

FAQ

What is an AI eval? A repeatable test that scores an AI feature's outputs against a defined standard of quality. Inputs come from a fixed dataset, outputs get graded by code, by a judge model, or by humans, and the resulting pass rate gates releases the way a regression suite gates deterministic software.

Can I just A/B test my AI feature instead of building evals? No, and you also cannot replace A/B tests with evals. An A/B test measures whether a shipped version moves a product metric. An eval measures whether a version is safe and correct enough to ship at all, across failure modes too rare for any live test to surface. You need both, in that order.

Do PMs write evals themselves? PMs write the rubric and own the thresholds; engineers build the harness and the pipelines. The dividing line: anything that defines "good" or decides "ship" is PM work. Anything that executes the grading at scale is engineering work.

How big should a golden dataset be? Start around 30 failure-sourced examples and grow it with every triaged production failure. Once you start comparing versions, size becomes a statistical question: distinguishing a 90 percent pass rate from 85 percent takes roughly 700 examples per version, which is why small sets are smoke tests rather than verdicts.

Is LLM-as-judge reliable enough to trust? Within limits. Published results show strong judges agreeing with human raters more than 80 percent of the time, comparable to human-to-human agreement². Judges also carry documented biases toward longer answers, first-position answers, and their own model family. Calibrate against a hand-graded sample before trusting the number, and keep a weekly human review in the loop.

Sources

Your AI Product Needs Evals (Hamel Husain) ↩
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena (Zheng et al., 2023, arXiv:2306.05685). The over-80-percent agreement figure, matching human-to-human agreement, and the position, verbosity, and self-enhancement bias findings all come from this paper. ↩ ↩² ↩³
AI Evals for Product Managers: A Beginner's Guide (Amplitude) ↩
AI Evals for Product Managers (Productboard) ↩