The single biggest mistake in A/B testing is shipping because p < 0.05. Statistical significance asks if the effect is real. Practical significance asks if it is worth it.
Last updated: 2026-04-01
A statistical claim that an observed difference is unlikely to have happened by chance. Reported via a p-value or confidence interval. Standard threshold: p < 0.05.
Best as a guard against acting on noise. A statistically significant result tells you the effect is probably real.
A business judgment about whether the size of the effect is worth shipping, given costs, risk, and opportunity cost. Often expressed as a minimum effect of interest or MDE.
Best as a sanity check. A practically significant result tells you the effect is worth acting on, not just measurable.
z = (p1 - p2) / sqrt(p_pool x (1 - p_pool) x (1/n1 + 1/n2))Compare z against the critical value for your alpha. Or compare the resulting p-value against alpha directly. Default alpha = 0.05.
No formula. Set a minimum effect of interest (MEI) before the test starts.The MEI drives the MDE, which drives the sample size. Detecting half the effect requires roughly four times the sample.
| Criteria | Statistical Significance | Practical Significance |
|---|---|---|
| Question answered | Is the effect real? | Is the effect worth it? |
| Type | Statistical claim | Business judgment |
| Reported as | p-value, confidence interval | Minimum effect of interest, MDE |
| When small samples | Critical. Easy to mistake noise for signal | Less critical. The number rarely beats MEI anyway |
| When huge samples | Risky. Everything becomes "significant" | Critical. Tiny effects pass p < 0.05 without mattering |
| Set when | After the test, computed from data | Before the test, agreed by the team |
| The trap | Shipping anything with p < 0.05 | Setting MEI so high the test always under-powers |
| Healthy practice | Always check both gates | Always check both gates |
Pros
Cons
Pros
Cons
Score your own data with both frameworks. Compare results and pick the one that fits your team.
The smallest true effect your test can reliably detect, given your sample size, alpha, and power. Set the MDE before you run the test. Detecting half the effect requires four times the sample size, so the MDE is the lever that controls test cost.
No, not on its own. p < 0.05 only tells you the effect is probably real. It says nothing about whether the effect is large enough to matter. Always pair statistical significance with a check against your minimum effect of interest.
Yes, and it's often a sign your test is under-powered. A 10% lift you can see but can't statistically confirm means you didn't run long enough or didn't have enough traffic. Run again with more sample.
The smallest lift that would change a real decision: ship/no-ship, build/don't build. For most growth tests, that's 3 to 5%. For pricing tests, 1 to 2%. For experimental redesigns, sometimes 10%+. The MDE should reflect the cost of acting, not the appetite of the team.
Because larger effects are easier to detect. Detecting a 10% lift requires far less sample than detecting a 1% lift. Raising the MDE is the most direct way to make a test feasible at your traffic level. The tradeoff: smaller real effects become invisible.