Follow these 6 steps to run a statistically valid A/B test. Covers hypothesis formation, sample size, test design, peeking bias, and post-analysis for product managers.
A well-formed hypothesis is the difference between a test that produces learning and one that produces confusion. Every A/B test starts with a falsifiable hypothesis: what you are changing, why you expect it to move the metric, and how you will measure it.
Formula
Hypothesis: If [change], then [metric] will [increase/decrease] by [X]% because [reason]Pro tip: Write your hypothesis before you look at any data. Post-hoc hypothesis formation (writing it after seeing early results) is a form of p-hacking that invalidates the test. Document the hypothesis, MDE, and expected direction in a shared experiment log before any code is shipped.
Running a test without calculating the required sample size first is the most common A/B testing mistake. Under-powered tests produce unreliable results: false positives (declaring a winner that is not real) and false negatives (missing a real improvement).
Formula
Sample Size per Variant = (Z_power + Z_alpha)^2 x p x (1-p) / (MDE)^2Pro tip: Use the PM Toolkit A/B Test Pre-Test Planner to get your exact sample size and test duration. Most teams underestimate required sample sizes because they set MDE too optimistically. If the required duration exceeds 60-90 days, reconsider whether the test is feasible or whether to raise your MDE threshold.
A/B testing depends on isolation: change exactly one variable between control and variant so that any observed difference can be causally attributed to that change. Multi-variable changes undermine causal inference.
Formula
Variant = Control + [Single Variable Change]. All else must be identical.Pro tip: Run an A/A test (two identical control groups) for one week before launching a major test. If the A/A test shows a statistically significant difference, you have a bug in your randomization or logging. Fix it before running any real tests. A/A tests also establish your expected false positive rate in practice.
Proper randomization is what makes a comparison a causal experiment. Poor randomization (the kind that allows systematic differences between groups) is the most common source of invalid A/B test results in product teams.
Formula
Variant Assignment = hash(user_id + experiment_id) mod 2 — deterministic and user-stablePro tip: Network effects can contaminate your control group in social or collaborative products. If user behavior influences other users (e.g., a collaboration feature), standard A/B testing is not valid. Use cluster randomization (randomize by team or organization, not by individual user) to preserve the integrity of the experiment.
Peeking (checking results and stopping a test early when it looks significant) is one of the worst A/B testing mistakes. It inflates your false positive rate, producing confident decisions based on noise rather than signal.
Formula
Test Duration = Sample Size Required / (Daily Traffic per Variant). Run until BOTH criteria are met.Pro tip: Use dashboard alerting on guardrail metrics only during the test. Hide the primary metric from daily dashboards for experiment owners during the run to reduce the temptation to peek. Some teams adopt a strict "two-key" policy where two people must agree before a test is stopped early. Apply the same rigor you would to changing any high-stakes business process.
Post-test analysis is where you determine whether observed differences reflect a real effect or random chance. Statistical significance testing replaces gut instinct with probability-based evidence.
Formula
p-value < 0.05 AND confidence interval excludes 0 → Statistically significant resultPro tip: Use the PM Toolkit A/B Test Post-Analysis tool to calculate p-values, confidence intervals, and effect sizes from your raw counts. A statistically significant result is not a deployment mandate. Factor in implementation complexity, technical debt, downstream effects, and strategic alignment before shipping a variant. Statistical significance is one input to a business decision, not the decision itself.
Our free 3-part A/B test suite covers pre-test sample size planning, live test monitoring, and post-analysis with statistical significance calculations.
An A/B test should run until it reaches both its pre-calculated sample size and at least one full business cycle (7-14 days minimum to capture weekday/weekend behavioral variation). Never stop a test early because it looks significant. That is peeking bias and it inflates your false positive rate. Use the PM Toolkit A/B Test Pre-Test Planner to calculate your exact required duration based on your traffic and minimum detectable effect.
Statistical significance (typically p < 0.05 at 95% confidence) means that if there were truly no difference between your control and variant, the probability of observing a result as extreme as yours by random chance alone is less than 5%. It does not mean your result is correct with 95% probability. It means you have enough evidence to reject the null hypothesis of no difference. Always combine statistical significance with practical significance (is the effect size worth acting on?) before making a deployment decision.
There is no universal minimum. Required sample size depends on your baseline conversion rate, desired minimum detectable effect (MDE), statistical power (typically 80%), and significance level (typically 5%). A test with a 3% baseline conversion rate detecting a 0.5 percentage point lift requires roughly 16,000 users per variant. Lower MDEs require exponentially larger samples. Use the PM Toolkit sample size calculator to get an exact number for your specific situation.
Yes, with caution. Simultaneous tests are valid as long as the tested changes are on different parts of the user flow and are unlikely to interact. If two tests could affect the same user behavior or metric, run them sequentially or use a factorial design. When running multiple tests, keep strict user-level assignment consistency: a user in experiment A should have a stable, random assignment in experiment B independent of their experiment A assignment.
Peeking bias occurs when you check A/B test results before the pre-planned end date and stop the test early when it reaches significance. Because statistical tests assume a fixed sample size, checking results at multiple interim points inflates the false positive rate. A test with 20 interim checks has an effective false positive rate of 20-40%, not 5%. To avoid peeking bias, commit to an end date before launching, monitor only guardrail metrics during the test, and use sequential testing methods (SPRT, mSPRT) if early stopping is a business requirement.