Complete Guide

How to Run an A/B Test

Follow these 6 steps to run a statistically valid A/B test. Covers hypothesis formation, sample size, test design, peeking bias, and post-analysis for product managers.

1
Define Your Hypothesis and Success Metric

A well-formed hypothesis is the difference between a test that produces learning and one that produces confusion. Every A/B test starts with a falsifiable hypothesis: what you are changing, why you expect it to move the metric, and how you will measure it.

Use the structure: "We believe that [change] will cause [outcome] because [reason], measured by [metric]."
Example: "We believe that moving the CTA button above the fold will increase free trial sign-up rate because users will see the primary action before scrolling, measured by 7-day free trial conversion rate."
Choose one primary metric. Secondary metrics are monitored but do not determine the winner.
Primary metrics for product tests: conversion rate, activation rate, retention (Day 7, Day 30), revenue per user, task completion rate.
Avoid vanity metrics like page views or time on page unless they directly tie to your business outcome.
Specify the minimum detectable effect (MDE): the smallest improvement worth deploying. If a 1% lift is not worth the engineering cost, set your MDE at 5%. This also determines your required sample size.

Formula

Hypothesis: If [change], then [metric] will [increase/decrease] by [X]% because [reason]

Pro tip: Write your hypothesis before you look at any data. Post-hoc hypothesis formation (writing it after seeing early results) is a form of p-hacking that invalidates the test. Document the hypothesis, MDE, and expected direction in a shared experiment log before any code is shipped.

2
Calculate Required Sample Size

Running a test without calculating the required sample size first is the most common A/B testing mistake. Under-powered tests produce unreliable results: false positives (declaring a winner that is not real) and false negatives (missing a real improvement).

Set your statistical power at 80%. That means an 80% chance of detecting a real effect at your specified MDE.
Set your significance level (alpha) at 5% (0.05). That means a 5% probability of a false positive.
You need your current baseline conversion rate (from historical data) and your minimum detectable effect (MDE).
Sample size formula inputs: baseline conversion rate (p), desired MDE (E), Z-score for 80% power (1.28), Z-score for 95% confidence (1.96).
Example: Baseline conversion rate = 3%, MDE = 0.5 percentage points (a 17% relative lift), required sample size per variant ≈ 16,000 users.
At 1,000 daily users split 50/50, this test takes 32 days to reach 16,000 per variant. Longer than most teams expect.

Formula

Sample Size per Variant = (Z_power + Z_alpha)^2 x p x (1-p) / (MDE)^2

Pro tip: Use the PM Toolkit A/B Test Pre-Test Planner to get your exact sample size and test duration. Most teams underestimate required sample sizes because they set MDE too optimistically. If the required duration exceeds 60-90 days, reconsider whether the test is feasible or whether to raise your MDE threshold.

3
Design Control and Variant with a Single Variable

A/B testing depends on isolation: change exactly one variable between control and variant so that any observed difference can be causally attributed to that change. Multi-variable changes undermine causal inference.

Control (A): The current experience, unchanged. This is your baseline.
Variant (B): The current experience with exactly one modification. A different button color, headline copy, CTA text, layout, or feature toggle.
Resist the temptation to "while we are at it" add multiple changes. If you change the button color AND the headline AND add a trust badge, you cannot know which change drove the result.
For major redesigns where isolation is impractical, use multivariate testing (MVT). Note that MVT requires far larger sample sizes.
Ensure technical implementation is correct: the randomization is at the user level (not session level), both variants load at the same speed, and the experiment is logged for every user who is exposed.
Check for SRM (Sample Ratio Mismatch) before analysis: if your target was 50/50 split but you got 55/45, something is wrong with your randomization.

Formula

Variant = Control + [Single Variable Change]. All else must be identical.

Pro tip: Run an A/A test (two identical control groups) for one week before launching a major test. If the A/A test shows a statistically significant difference, you have a bug in your randomization or logging. Fix it before running any real tests. A/A tests also establish your expected false positive rate in practice.

4
Randomize Users and Launch the Test

Proper randomization is what makes a comparison a causal experiment. Poor randomization (the kind that allows systematic differences between groups) is the most common source of invalid A/B test results in product teams.

Assign users to variants using a hash of their user ID, not session IDs or cookies. User-level assignment ensures a returning visitor always sees the same variant.
Use a 50/50 split for maximum statistical power. Unequal splits (e.g., 90/10) require significantly larger sample sizes to detect the same effect.
Implement a holdout group: a small percentage (5-10%) that sees neither variant. This is your experiment-free baseline for long-term effects.
Log every exposure: user ID, variant assigned, timestamp, and session context. This data is needed for post-analysis and debugging.
Do not start counting results until the variant is fully deployed. Users who saw the control before the variant launched should be excluded.
Use the PM Toolkit A/B Test suite to track exposure counts and monitor for sample ratio mismatches as the test runs.

Formula

Variant Assignment = hash(user_id + experiment_id) mod 2 — deterministic and user-stable

Pro tip: Network effects can contaminate your control group in social or collaborative products. If user behavior influences other users (e.g., a collaboration feature), standard A/B testing is not valid. Use cluster randomization (randomize by team or organization, not by individual user) to preserve the integrity of the experiment.

5
Monitor the Test Without Peeking

Peeking (checking results and stopping a test early when it looks significant) is one of the worst A/B testing mistakes. It inflates your false positive rate, producing confident decisions based on noise rather than signal.

Peeking bias explained: If you check results at 30 interim points and stop when p < 0.05, your actual false positive rate is closer to 30-40%, not 5%. You will call a winner that is not real roughly 1 in 3 times.
Set a predetermined end date based on your sample size calculation. Do not stop until you have reached this date AND the required sample size.
Monitor guardrail metrics (latency, error rate, revenue per user, core engagement), not your primary test metric. If guardrails degrade significantly, stopping is justified.
If your business genuinely requires early stopping (time-sensitive launches, safety issues), use sequential testing methods like SPRT (Sequential Probability Ratio Test) or mSPRT, which control false positive rate while allowing valid early stopping.
Run the test for at least one full business cycle (typically 7-14 days) to capture weekday/weekend behavioral differences.
Document the pre-committed end date in your experiment log. Discipline here is what separates a learning culture from a team that runs tests for show.

Formula

Test Duration = Sample Size Required / (Daily Traffic per Variant). Run until BOTH criteria are met.

Pro tip: Use dashboard alerting on guardrail metrics only during the test. Hide the primary metric from daily dashboards for experiment owners during the run to reduce the temptation to peek. Some teams adopt a strict "two-key" policy where two people must agree before a test is stopped early. Apply the same rigor you would to changing any high-stakes business process.

6
Analyze Results with Statistical Significance Testing

Post-test analysis is where you determine whether observed differences reflect a real effect or random chance. Statistical significance testing replaces gut instinct with probability-based evidence.

Calculate the p-value: the probability of observing your result (or more extreme) if the null hypothesis (no difference) were true. A p-value below 0.05 means you reject the null hypothesis.
Calculate the confidence interval for the effect: a range that contains the true effect size 95% of the time. If the CI does not include zero, the result is statistically significant.
Check practical significance alongside statistical significance: a 0.1% lift with p = 0.001 is statistically significant but may not be worth deploying if the implementation cost is high.
Analyze secondary metrics and guardrails: did the winning variant improve your primary metric without degrading other important signals?
Segment your results: break down by device type, user tenure, geography, and plan tier. An aggregate win can mask a loss in a specific segment, and vice versa.
Document and share: write up the experiment hypothesis, method, results, and decision in your experiment log regardless of outcome. Failed tests are as valuable as wins. They prevent re-running experiments that do not work.

Formula

p-value < 0.05 AND confidence interval excludes 0 → Statistically significant result

Pro tip: Use the PM Toolkit A/B Test Post-Analysis tool to calculate p-values, confidence intervals, and effect sizes from your raw counts. A statistically significant result is not a deployment mandate. Factor in implementation complexity, technical debt, downstream effects, and strategic alignment before shipping a variant. Statistical significance is one input to a business decision, not the decision itself.

Run Better A/B Tests with PM Toolkit

Our free 3-part A/B test suite covers pre-test sample size planning, live test monitoring, and post-analysis with statistical significance calculations.

Frequently Asked Questions

How long should an A/B test run?

An A/B test should run until it reaches both its pre-calculated sample size and at least one full business cycle (7-14 days minimum to capture weekday/weekend behavioral variation). Never stop a test early because it looks significant. That is peeking bias and it inflates your false positive rate. Use the PM Toolkit A/B Test Pre-Test Planner to calculate your exact required duration based on your traffic and minimum detectable effect.

What is statistical significance and why does it matter?

Statistical significance (typically p < 0.05 at 95% confidence) means that if there were truly no difference between your control and variant, the probability of observing a result as extreme as yours by random chance alone is less than 5%. It does not mean your result is correct with 95% probability. It means you have enough evidence to reject the null hypothesis of no difference. Always combine statistical significance with practical significance (is the effect size worth acting on?) before making a deployment decision.

What is the minimum sample size for an A/B test?

There is no universal minimum. Required sample size depends on your baseline conversion rate, desired minimum detectable effect (MDE), statistical power (typically 80%), and significance level (typically 5%). A test with a 3% baseline conversion rate detecting a 0.5 percentage point lift requires roughly 16,000 users per variant. Lower MDEs require exponentially larger samples. Use the PM Toolkit sample size calculator to get an exact number for your specific situation.

Can I run multiple A/B tests at the same time?

Yes, with caution. Simultaneous tests are valid as long as the tested changes are on different parts of the user flow and are unlikely to interact. If two tests could affect the same user behavior or metric, run them sequentially or use a factorial design. When running multiple tests, keep strict user-level assignment consistency: a user in experiment A should have a stable, random assignment in experiment B independent of their experiment A assignment.

What is peeking bias in A/B testing?

Peeking bias occurs when you check A/B test results before the pre-planned end date and stop the test early when it reaches significance. Because statistical tests assume a fixed sample size, checking results at multiple interim points inflates the false positive rate. A test with 20 interim checks has an effective false positive rate of 20-40%, not 5%. To avoid peeking bias, commit to an end date before launching, monitor only guardrail metrics during the test, and use sequential testing methods (SPRT, mSPRT) if early stopping is a business requirement.