Statistical vs Practical Significance

The single biggest mistake in A/B testing is shipping because p < 0.05. Statistical significance asks if the effect is real. Practical significance asks if it is worth it.

Last updated: 2026-04-01

Overview

Statistical Significance
Is It Real?

A statistical claim that an observed difference is unlikely to have happened by chance. Reported via a p-value or confidence interval. Standard threshold: p < 0.05.

Best as a guard against acting on noise. A statistically significant result tells you the effect is probably real.

Practical Significance
Is It Worth It?

A business judgment about whether the size of the effect is worth shipping, given costs, risk, and opportunity cost. Often expressed as a minimum effect of interest or MDE.

Best as a sanity check. A practically significant result tells you the effect is worth acting on, not just measurable.

Formula comparison

Statistical Significance

z = (p1 - p2) / sqrt(p_pool x (1 - p_pool) x (1/n1 + 1/n2))

Compare z against the critical value for your alpha. Or compare the resulting p-value against alpha directly. Default alpha = 0.05.

Practical Significance

No formula. Set a minimum effect of interest (MEI) before the test starts.

The MEI drives the MDE, which drives the sample size. Detecting half the effect requires roughly four times the sample.

Side-by-side comparison

CriteriaStatistical SignificancePractical Significance
Question answeredIs the effect real?Is the effect worth it?
TypeStatistical claimBusiness judgment
Reported asp-value, confidence intervalMinimum effect of interest, MDE
When small samplesCritical. Easy to mistake noise for signalLess critical. The number rarely beats MEI anyway
When huge samplesRisky. Everything becomes "significant"Critical. Tiny effects pass p < 0.05 without mattering
Set whenAfter the test, computed from dataBefore the test, agreed by the team
The trapShipping anything with p < 0.05Setting MEI so high the test always under-powers
Healthy practiceAlways check both gatesAlways check both gates

When to use each

Choose Statistical Significance when
  • Sample size is small. Small samples are noisy and easy to misread
  • The cost of a false positive is high (a launch that breaks something)
  • Stakeholders will scrutinize the result
  • You're under pressure to ship and need to defend the decision
Choose Practical Significance when
  • Sample size is huge. Tiny differences become "significant" but not meaningful
  • The change has costs beyond the test (engineering, support, risk)
  • You're prioritizing a roadmap of experiments. Small wins eat capacity
  • The effect size is below your minimum effect of interest

Pros and cons

Statistical Significance

Pros

  • Quantitative gate against noise
  • Standard. Stakeholders recognize p-values and confidence intervals
  • Pairs naturally with sample-size planning

Cons

  • Easy to misuse at very large samples (everything becomes significant)
  • Treated as binary when it's a continuous risk measure
  • Doesn't speak to whether the effect matters

Practical Significance

Pros

  • Forces the business question before the test
  • Filters out small wins that aren't worth ship cost
  • Makes the test plan honest about the MDE

Cons

  • "Worth it" depends on the team and the moment
  • Easy to forget. Most A/B testing tools don't surface it
  • Can be gamed by setting an MDE so large the test is doomed to under-power

Try both calculators

Score your own data with both frameworks. Compare results and pick the one that fits your team.

Frequently asked questions

What is the minimum detectable effect (MDE)?

The smallest true effect your test can reliably detect, given your sample size, alpha, and power. Set the MDE before you run the test. Detecting half the effect requires four times the sample size, so the MDE is the lever that controls test cost.

Is p < 0.05 enough to ship?

No, not on its own. p < 0.05 only tells you the effect is probably real. It says nothing about whether the effect is large enough to matter. Always pair statistical significance with a check against your minimum effect of interest.

Can a result be practically significant but not statistically significant?

Yes, and it's often a sign your test is under-powered. A 10% lift you can see but can't statistically confirm means you didn't run long enough or didn't have enough traffic. Run again with more sample.

What's the right MDE for my test?

The smallest lift that would change a real decision: ship/no-ship, build/don't build. For most growth tests, that's 3 to 5%. For pricing tests, 1 to 2%. For experimental redesigns, sometimes 10%+. The MDE should reflect the cost of acting, not the appetite of the team.

Why does sample size shrink with a larger MDE?

Because larger effects are easier to detect. Detecting a 10% lift requires far less sample than detecting a 1% lift. Raising the MDE is the most direct way to make a test feasible at your traffic level. The tradeoff: smaller real effects become invisible.