Statistical Significance Calculator
Enter your control and variant data to find out if your A/B test results are real — not luck. Get instant p-values, confidence intervals, and relative uplift.
Haven't run your test yet? Calculate how many visitors you need before you start.
A/B Test Power Calculator →Control
Variant
Test Results
p-value
Relative uplift
Control rate
Variant rate
Confidence interval
The true difference in conversion rate is likely within this range.
How to Run an A/B Test
Five steps to a statistically valid experiment — from hypothesis to launch decision.
Define your hypothesis
Start with a specific, testable claim. Not 'test red vs. blue' but 'changing our CTA from Submit to Get Free Audit will increase form submissions by at least 10%.' A strong hypothesis defines what you're measuring and what you expect to see. It keeps the test honest and stops you changing the goal mid-experiment.
Calculate your required sample size
Before you start, work out how many conversions you need per group. This depends on your baseline conversion rate, the minimum lift you care about detecting (5%? 20%?), and your confidence level (95% is standard). Skipping this step is the most common cause of inconclusive tests — you either stop too early or run far longer than necessary. Use our A/B Test Power Calculator to get the exact number before your test starts.
Assign traffic randomly
Split your audience 50/50 between control and variant using a proper randomisation method. Avoid manual splits like 'Monday traffic to control, Wednesday to variant' — day-of-week behaviour differences will corrupt your results. Most A/B testing tools handle this automatically.
Run the test for the full duration
Set the end date before you start, then don't look at results until you hit your sample size. Every time you check and consider stopping, you inflate your false-positive rate. A 15% lift after one week can easily vanish by week four. Patience here is not optional — it's the method.
Analyse results with this calculator
Once you've hit your sample size, enter your numbers above. If p < 0.05, your result is statistically significant at 95% confidence — launch the winner. If p > 0.05, you don't have enough evidence yet. Either extend the test or call it a draw.
Common A/B Testing Pitfalls
Most bad test results come from one of these six mistakes — not from the variant being bad.
Peeking at results
Checking results daily and stopping when you like what you see pushes your real false-positive rate from 5% to 25% or higher. Decide your sample size upfront. Then don't look until you hit it.
Stopping too early
A 20% lift after two weeks looks convincing. It may be noise. Early data is the most volatile. The effect tends to stabilise — or disappear — as more data comes in. Run the full duration you planned.
Sample size too small
50 conversions per group isn't enough to detect anything meaningful. Underpowered tests either miss real winners or declare noise as signal. Calculate sample size before you start, not after you're disappointed with the results.
Testing too many variants
Running 5 variants against one control pushes your real false-positive rate to ~23%. Test one change at a time. If you must test multiple variants, apply a Bonferroni correction — divide your p-value threshold by the number of comparisons.
Ignoring seasonality
Testing across Black Friday, a major email send, or a competitor promotion contaminates your data. Run full weeks (Monday–Sunday) to average out day-of-week effects. Avoid starting tests during known traffic anomalies.
Confusing statistical with practical significance
p = 0.03 on a 0.2% lift is significant but not meaningful. Always define your minimum detectable effect based on business impact before you run the test — not after you see the results.
Frequently Asked Questions
What is a p-value?
A p-value is the probability that your result happened by random chance, assuming the control and variant perform identically. A p-value of 0.05 means there is a 5% chance the result is luck. The conventional cutoff is p < 0.05 — meaning you're willing to accept a 5% false-positive rate. Lower p-values indicate stronger evidence.
What's the difference between a p-value and a confidence interval?
A p-value answers a yes/no question: is this result statistically significant? A confidence interval answers a different question: how large is the effect? For example, 'the true lift is between 4% and 18%.' Confidence intervals are often more useful for business decisions because they show magnitude, not just significance.
How long should I run an A/B test?
Until you hit your pre-calculated sample size — and for at least 1–2 full weeks to smooth out day-of-week variation. Calculate the required sample size before you start, based on your baseline conversion rate and the minimum lift you want to detect. Never set your test duration based on how quickly you see 'significant' results.
Can I stop the test early if results are clearly winning?
No. Stopping early when results look strong is the most common testing mistake — it inflates your false-positive rate from 5% to 25% or higher. Decide your sample size upfront and honour it. If you're tempted to stop early, have a teammate lock the test end date in your testing tool before you start.
What's the difference between statistical and practical significance?
Statistical significance means the result is probably not luck. Practical significance means it is large enough to act on. A 0.1% lift can be statistically significant with a huge sample but not worth the engineering effort to launch. Always define your minimum detectable effect based on business impact — not just whether you can detect something.
What sample size do I need?
It depends on your baseline conversion rate and the lift you want to detect. A rough guide: a 5% baseline with a 20% relative lift target requires ~1,900 conversions per group at 95% confidence. A 1% baseline with the same target requires ~9,500 per group. Use our A/B Test Power Calculator to get the exact number for your test.
My p-value is 0.06 — is that close enough?
No. 0.06 is not statistically significant at the 95% threshold. You have two options: collect more data and retest if your original sample size was too small, or accept the test as inconclusive. Do not lower your threshold to 0.10 to force significance — doing so doubles your false-positive rate.
How do I know if my result is a fluke?
That is exactly what this calculator tells you. If p < 0.05, you can be 95% confident the result is real. If p > 0.05, you cannot rule out that it's luck — run the test longer, increase your sample size, or call it a draw and move on to the next hypothesis.
Planning your next test?
Calculate the sample size you need before you start — so you know exactly when to stop.
Work with Jarrah
Ready to scale your winners?
We run paid media and CRO programs built on rigorous testing — not hunches.