PK Systems PK Systems
Marketing

A/B Test Significance Calculator

Two-proportion z-test for split tests, with p-value, lift and confidence verdict.

A/B Test Significance Calculator

Result

Enter both groups to evaluate the test.

What this calculator tells you

An A/B test compares two versions of a page or experience to see which one performs better. The challenge is separating real differences from random noise: with small samples, almost any difference can look like a winner. The two-proportion z-test answers the question 'how likely is this difference to be just chance?' The result is a p-value — the probability that you'd see a gap this big or bigger if both versions were actually identical. If the p-value is below your chosen alpha (typically 0.05 for 95% confidence), you can call the result statistically significant.

How to use this calculator

Plug in raw counts from your test platform — visitors and conversions per arm — and pick a confidence level.

  1. Enter the control group's visitors and conversions.
  2. Enter the variant group's visitors and conversions.
  3. Choose 95% for standard product testing, 99% for high-risk decisions.
  4. Read the verdict — and don't peek at it before the test ends.

Formulas

The two-proportion z-test pools both samples to estimate a common variance, then measures how many standard errors apart the two proportions are.

p_A = c_A ÷ n_A    p_B = c_B ÷ n_B

p_pool = (c_A + c_B) ÷ (n_A + n_B)

SE = √( p_pool × (1 − p_pool) × (1/n_A + 1/n_B) )

Z = (p_B − p_A) ÷ SE

Lift % = ( (p_B − p_A) ÷ p_A ) × 100

  • n_A, n_B — visitors in control and variant.
  • c_A, c_B — conversions in control and variant.
  • Z — standardized distance between the two rates; converted to a two-sided p-value via the standard normal CDF.

Confidence-level reference

Pick the confidence level that matches the risk of being wrong on this decision.

Confidence Z threshold (two-sided) Max p-value When to use
90%1.6450.10Exploratory / directional reads
95%1.9600.05Default for most product tests
99%2.5760.01High-stakes UX or pricing changes

Two-sided test; assumes independent visitors and a binary outcome (converted / not converted).

Frequently asked questions

How big does my test need to be?

Bigger is always better, but as a rule of thumb you want at least 100 conversions per arm before reading results. Anything smaller is too noisy regardless of the p-value.

What is a p-value?

The probability that the difference you see (or larger) would occur by chance if the two versions were equally good. A p-value of 0.03 means there's a 3% chance the result is a fluke.

Why shouldn't I peek before the end?

Repeatedly checking inflates false positives. Each peek is another chance for noise to cross the threshold. Decide on a sample size up front and only call the test when it's reached.

Is 95% confidence enough?

For most product changes, yes. For pricing, checkout, or anything that affects revenue at scale, jump to 99%. The cost of a wrong call is much higher there.

My result is significant but the lift is tiny — should I ship?

Significant ≠ meaningful. With huge samples, a 0.2% lift can be significant yet not worth the engineering or risk to ship. Compare lift to the cost of the change.

What if my test never reaches significance?

Either the effect is too small to matter, or you need more traffic. Decide on a minimum detectable effect (MDE) before launching; if you can't reach it in a reasonable window, the test is unanswerable at your scale.