A/B Test & Experiment Designer
Design rigorous experiments with hypothesis, sample size, and analysis plan
- • Hypothesis written If/Then/Because with the behavior theory you're testing.
- • One decision metric, stated MDE, and power-based sample size and duration.
- • Randomization plan, SRM checks, and guardrails to avoid breaking things.
- • Ramp and kill criteria agreed up front; rollback takes minutes, not hours.
- • Pre-agreed analysis: peeking rules, segments, and how you decide ship/hold.
- • Underpowered tests that “almost” moved—weeks lost, nothing learned.
- • Peeking daily and pivoting mid‑run; novelty and regression bite later.
- • Mixed variants (too many changes) so you can't tell what worked.
- • No SRM or bot filters—your split is 55/45 and nobody notices.
- • Measuring the wrong thing: clicks up, revenue down. Guardrails would help.
How do I write a hypothesis that isn't fluff?
If [we remove one step from onboarding], then [activation rate increases by 12%] because [drop‑offs cluster at the second form and users say it feels like paperwork]. If/Then/Because forces you to show your reasoning, not just hope.
What's a reasonable minimum detectable effect (MDE)?
Tie it to business impact and traffic reality. If you need 1% uplift to matter but traffic can only power 6%, either change the surface, extend duration, or don't run it. I learned this the hard way—beautiful zero‑learning tests.
Can I peek at results?
You can monitor health daily, but don't move the goalposts. Define a peeking schedule and stick to it (e.g., day 3 health, day 7 interim, final at planned N). Otherwise you're just noise‑chasing.
How do I avoid SRM and other traps?
Automate SRM checks, filter bots, and hash randomize on a stable key (user, not session). Also: pause overlapping experiments on the same surface unless you have an interaction design.
My exec wants to ramp early because it "looks good". Help?
Pre‑commit to kill and ship criteria. Put the table in the doc before launch. When pressure hits, point to the rules we all signed. It saves relationships—and your data.
What if results are flat?
Extract learning anyway. Check exposure, novelty, and segmentation. Sometimes the variant helps a sub‑segment or hurts another. Ship a follow‑up with a bolder change or a narrower audience.
Which metrics should be guardrails vs. secondary?
Guardrails protect the business (latency, error rate, refund rate, support tickets). Secondary explain the story (CTR, time on task). If guardrails burn, you stop. No interpretive dance.
How long should we run an experiment?
At least one full business cycle (weekly patterns matter) and until your powered sample size is reached. Stop‑early rules must be defined a priori; “feels done” is not a rule.
Do I need an A/A test?
Run one when the surface, targeting, or tracking changed. It catches randomization bugs before you bet a month on a broken split. I wish I didn't know this from experience.
How do I make the decision obvious at the end?
Put a decision table in the doc: win = ship plan; lose = learnings + next hypothesis; inconclusive = extend or redesign. Add owners and dates so it doesn't die in a slide.
When to Use
Feature validation and optimization
Pro Tips
- •Be specific with your variable inputs for better results
- •Review and iterate on the AI output as needed
- •Enable web search for the most current information
Expected Output
Comprehensive experiment design document