What is a reasonable minimum detectable effect (MDE)?

Pick the smallest change that matters to the business and that your traffic can power in a sane timeframe. If impact needs 1% but power requires 6%, change the surface, extend duration, or skip the test.

How long should we run an A/B test?

At least one full business cycle (cover weekly patterns) and until the powered sample size is reached. Define stop-early and peeking rules a priori to avoid noise-chasing.

How do we avoid SRM and randomization issues?

Hash randomize on a stable key (user, not session), filter bots, and automate Sample Ratio Mismatch checks. Run an A/A test when surfaces, targeting, or tracking changed.

What are common experiment design mistakes to avoid?

Underpowered tests, peeking and reacting mid-run, combining multiple changes per variant, no SRM checks, overlapping experiments on the same surface, and measuring the wrong outcome.

Do we need an A/A test?

Run an A/A when you change randomization, targeting, or tracking. It validates pipelines and split balance before you spend weeks on a broken test.

How should we handle novelty effects?

Set minimum runtime to cover initial excitement (or confusion). Track leading indicators separately, and check stability over time before calling a win.

A/B Test & Experiment Designer

Q: How do I write a real hypothesis, not fluff?

Use If/Then/Because. Example: If we remove one onboarding step, then activation increases by 12% because drop-offs cluster at the second form and users describe it as “paperwork”. Force yourself to state the behavior theory.

Q: Can I peek at results?

Monitor health daily but don't move the goalposts. Pre-define interim looks (e.g., day 3 health, day 7 interim) and a final analysis point. Changing rules mid-run biases outcomes.

Q: How do we make the final decision obvious?

Add a decision table upfront: win = ship plan; lose = learnings + next hypothesis; inconclusive = extend or redesign. Include owners and dates so actions happen.

Design rigorous experiments with hypothesis, sample size, and analysis plan

analysisPopularintermediateHypothesis TestingStatistical AnalysisExperimentation1200-1600 words

Customize Your Prompt

Fill in the variables to generate your personalized prompt

Feature or Change to Test

Primary Success Metric

Additional Context (Optional)

Enable web search for current data

Preview

See how your prompt will look with the current variables

You are a Senior Product Manager designing a rigorous A/B test for [Feature or Change to Test] measuring [Primary Success Metric].

Structure your experiment design:

## 1. HYPOTHESIS FORMATION

### Primary Hypothesis
**If** [specific change]
**Then** [expected outcome with specific metric movement]
**Because** [underlying behavioral/psychological reasoning]

### Null Hypothesis
State what you're testing against (no effect or current state)

### Success Criteria
- Minimum detectable effect: [X% improvement needed]
- Statistical significance: [typically 95% confidence]
- Practical significance: [Business impact threshold]

## 2. EXPERIMENT DESIGN

### Test Architecture
- **Control (A)**: [Current experience description]
- **Variant(s) (B, C...)**: [Changed experience description]
- **Key differences**: [Bullet list of what's different]

### Target Audience
- **Inclusion criteria**: [Who's eligible]
- **Exclusion criteria**: [Who to exclude and why]
- **Segmentation plans**: [Subgroups to analyze separately]

### Randomization Strategy
- Method: [User ID hash, session-based, etc.]
- Split: [50/50, 80/20, etc. with justification]
- Stratification: [If needed for balance]

## 3. SAMPLE SIZE & DURATION

### Statistical Power Calculation
- Baseline rate: [Current metric value]
- Minimum detectable effect: [Smallest meaningful change]
- Power: [Typically 80%]
- Significance level: [Typically 5%]
- **Required sample size**: [N per variant]
- **Estimated duration**: [Days/weeks based on traffic]

### Duration Considerations
- Weekly/seasonal patterns to account for
- Minimum runtime to avoid novelty effects
- Maximum runtime to avoid opportunity cost

## 4. METRICS FRAMEWORK

### Primary Metrics
- **Decision metric**: [The one metric that determines success]
- **Target movement**: [X% with confidence interval]

### Secondary Metrics
- Supporting metrics: [Related positive indicators]
- Counter metrics: [What might get worse]
- Guardrail metrics: [What must not degrade]

### Leading Indicators
- Early signals: [Metrics that move first]
- Diagnostic metrics: [Help explain why]

## 5. RISK MITIGATION

### Potential Risks
- **Technical risks**: [Performance, bugs, edge cases]
- **User experience risks**: [Confusion, frustration]
- **Business risks**: [Revenue impact, brand perception]

### Mitigation Strategies
- Ramp plan: [Start with X% of traffic]
- Kill criteria: [When to stop early]
- Rollback plan: [How to revert quickly]
- Communication plan: [If issues arise]

## 6. ANALYSIS PLAN

### Pre-Test Analysis
- A/A test validation: [Ensure randomization works]
- Power analysis confirmation
- Baseline metric stability check

### During Test Monitoring
- Daily health checks: [What to monitor]
- Peeking strategy: [When/how to check results]
- SRM (Sample Ratio Mismatch) detection

### Post-Test Analysis
- Primary analysis: [ITT, per-protocol, etc.]
- Segment analysis: [User cohorts to examine]
- Novelty effect assessment
- Long-term impact projection

## 7. DECISION FRAMEWORK

### Success Scenario
- Ship criteria: [Specific conditions]
- Rollout plan: [Phased or full]
- Follow-up experiments: [What to test next]

### Failure Scenario
- Learning extraction: [What we learned]
- Alternative approaches: [Other solutions to try]
- Hypothesis refinement: [How to adjust]

### Inconclusive Scenario
- Extension criteria: [When to run longer]
- Additional data needs: [What would help decide]
- Default decision: [Ship or don't ship]

## 8. IMPLEMENTATION CHECKLIST

Pre-Launch:
- [ ] Hypothesis documented and reviewed
- [ ] Sample size calculation verified
- [ ] Tracking implementation tested
- [ ] Randomization validated with A/A test
- [ ] Rollback plan documented

During Test:
- [ ] Daily monitoring dashboard live
- [ ] SRM checks automated
- [ ] Stakeholder updates scheduled

Post-Test:
- [ ] Results analysis peer-reviewed
- [ ] Learnings documented
- [ ] Next steps communicated

Include specific numbers, clear success criteria, and confidence levels throughout.

## 🔍 Web Search Enhancement

**Leverage current web data to strengthen this analysis:**

1. **Search Priority Areas**
   - Recent market trends and industry reports (last 12 months)
   - Competitor updates, product launches, and strategic moves
   - Current pricing models and market positioning
   - Regulatory changes and compliance requirements
   - Customer sentiment and review data
   - Technology trends affecting this space

2. **Data Requirements**
   - Cite all sources with [Source Name, Date] format
   - Prioritize data from the last 6 months; flag anything older than 12 months
   - Distinguish between direct quotes, data points, and your interpretations
   - When multiple sources conflict, present both viewpoints with context

3. **Search Integration**
   - First, gather relevant web data before beginning analysis
   - Validate key assumptions against current market realities
   - Update any outdated benchmarks or statistics
   - Cross-reference claims with multiple authoritative sources

4. **Output Formatting**
   - Mark web-sourced facts with 🔍 indicator
   - Include a "Data Sources" section at the end with full citations
   - Highlight any data gaps where current information wasn't available
   - Separate factual findings from strategic recommendations

**Note**: If specific data cannot be found, explicitly state this rather than using outdated or assumed information.

## Important Guidelines

### Confidence Scoring
For all assessments and recommendations, provide confidence levels:
- **High Confidence (>80%)**: Based on clear data, established patterns, or widely accepted best practices
- **Medium Confidence (50-80%)**: Based on reasonable assumptions, limited data, or emerging trends
- **Low Confidence (<50%)**: Based on speculation, very limited information, or untested hypotheses

### Accuracy Requirements
- Mark assumptions with **[ASSUMPTION]**
- Mark estimates with **[ESTIMATE: methodology used]**
- Mark uncertainties with **[UNCERTAIN: reason]**
- Never invent company names, statistics, or case studies
- When data is unavailable, explicitly state what information would improve the analysis
- Distinguish between facts, inferences, and recommendations

### Source Attribution
- General knowledge: "Based on industry standards..."
- Inferences: "This suggests that..."
- Speculation: "One possibility is..."
- Best practices: "Common approaches include..."

What Makes a Good Experiment Design

• Hypothesis written If/Then/Because with the behavior theory you're testing.
• One decision metric, stated MDE, and power-based sample size and duration.
• Randomization plan, SRM checks, and guardrails to avoid breaking things.
• Ramp and kill criteria agreed up front; rollback takes minutes, not hours.
• Pre-agreed analysis: peeking rules, segments, and how you decide ship/hold.

Common Experiment Design Mistakes

• Underpowered tests that “almost” moved—weeks lost, nothing learned.
• Peeking daily and pivoting mid‑run; novelty and regression bite later.
• Mixed variants (too many changes) so you can't tell what worked.
• No SRM or bot filters—your split is 55/45 and nobody notices.
• Measuring the wrong thing: clicks up, revenue down. Guardrails would help.

Questions PMs Actually Ask (Experiment Design)

How do I write a hypothesis that isn't fluff?

If [we remove one step from onboarding], then [activation rate increases by 12%] because [drop‑offs cluster at the second form and users say it feels like paperwork]. If/Then/Because forces you to show your reasoning, not just hope.

What's a reasonable minimum detectable effect (MDE)?

Tie it to business impact and traffic reality. If you need 1% uplift to matter but traffic can only power 6%, either change the surface, extend duration, or don't run it. I learned this the hard way—beautiful zero‑learning tests.

Can I peek at results?

You can monitor health daily, but don't move the goalposts. Define a peeking schedule and stick to it (e.g., day 3 health, day 7 interim, final at planned N). Otherwise you're just noise‑chasing.

How do I avoid SRM and other traps?

Automate SRM checks, filter bots, and hash randomize on a stable key (user, not session). Also: pause overlapping experiments on the same surface unless you have an interaction design.

My exec wants to ramp early because it "looks good". Help?

Pre‑commit to kill and ship criteria. Put the table in the doc before launch. When pressure hits, point to the rules we all signed. It saves relationships—and your data.

What if results are flat?

Extract learning anyway. Check exposure, novelty, and segmentation. Sometimes the variant helps a sub‑segment or hurts another. Ship a follow‑up with a bolder change or a narrower audience.

Which metrics should be guardrails vs. secondary?

Guardrails protect the business (latency, error rate, refund rate, support tickets). Secondary explain the story (CTR, time on task). If guardrails burn, you stop. No interpretive dance.

How long should we run an experiment?

At least one full business cycle (weekly patterns matter) and until your powered sample size is reached. Stop‑early rules must be defined a priori; “feels done” is not a rule.

Do I need an A/A test?

Run one when the surface, targeting, or tracking changed. It catches randomization bugs before you bet a month on a broken split. I wish I didn't know this from experience.

How do I make the decision obvious at the end?

Put a decision table in the doc: win = ship plan; lose = learnings + next hypothesis; inconclusive = extend or redesign. Add owners and dates so it doesn't die in a slide.

How to Use This Prompt