Every marketer these days is "data-driven." And nothing says data-driven quite like A/B testing — the noble practice of showing two versions of something to different people and seeing which one wins.
In theory, it's beautiful. In practice, it's a crime scene.
I spent years as an actuary before moving into growth marketing analytics, which means I've had the unique pleasure of watching people apply vibes-based statistics to decisions involving significant budget. This article is the result of that experience — a kind of field guide to what good A/B testing looks like, written partly as genuine best practice, and partly as therapy.
The Hall of Shame: Unbelievable Mistakes Seen in the Wild
Let's start with the horror stories, because frankly they deserve to go first.
1. Removing the Control Entirely
Picture the scene. A team is excited about two new homepage designs. They're both bold departures from the current page. Someone suggests A/B testing them. Everyone nods. The test runs. A wins. Champagne.
What nobody did was keep the original page as a control.
They tested variant A against variant B, declared a winner, shipped it, and only noticed months later that overall conversion had dropped compared to the historical baseline. Both variants were worse than the original. They'd just been comparing two bad options and calling the least bad one a success.
Key takeaway
The control group is not optional. It is the entire point. Without a baseline, you're not running an experiment — you're just picking favourites with extra steps.
2. Triggering the Test on Site Arrival When the Change Is on the Pricing Page
This one is subtle enough that people don't always see it coming, which makes it worse.
The scenario: you're testing a new pricing page layout. The test gets implemented by triggering on session start — meaning every visitor to the site gets assigned to a variant the moment they land, regardless of whether they ever reach the pricing page.
The result: the vast majority of your "experiment" population never sees the change you're testing. They're assigned to control or treatment, they browse the blog, they leave, and they get counted as non-converting in both groups. The signal you're looking for is buried under a mountain of irrelevant noise, your effective sample size is a fraction of what you think it is, and your results are essentially meaningless.
Key takeaway
Trigger your test at the point of exposure. If the change is on the pricing page, trigger on pricing page arrival. Measure from the moment the user could actually be affected — not from when they walked in the front door.
3. Using Churn as the Success Metric for a New User Page View Test
This one deserves a special kind of recognition.
A team tests a redesigned onboarding flow for new users. Solid idea. The test runs for two weeks. Then someone sets the primary success metric as churn — defined as users who cancel within 90 days.
A two-week test. A 90-day outcome metric.
Not only will the test end long before most of your subjects have had a chance to churn (or not), but by the time you have enough data, the test has been off for months, the team has moved on to three other initiatives, and any causal link between the onboarding change and the churn outcome is essentially impossible to establish. You've also got survivorship effects, seasonal variation, product changes in the interim, and about fifteen other confounders having a party in your data.
Key takeaway
Pick a success metric that you can actually measure within the window of your test. Proxy metrics exist for a reason. If you're testing onboarding, measure activation rate, time-to-first-value, or early engagement — things that move within days and correlate with downstream retention. Churn is a business outcome, not a test metric.
4. Stopping the Test the Moment You See a Positive Result
Known in the statistics literature as "optional stopping" or the "peeking problem," this is perhaps the most widespread sin in A/B testing, and it's committed by people who genuinely think they're being efficient.
It goes like this: the test launches Monday. By Wednesday, variant B is showing a 15% lift. Someone checks the dashboard, sees p=0.04, declares victory, and stops the test. Ship it.
What's actually happened is a collision of two phenomena. First, early data is noisy — small samples produce extreme values by chance, and what looks like a strong signal on day two often regresses toward the mean by day ten. Second, every time you peek at a running test and make a stopping decision, you're effectively running multiple statistical tests, which inflates your false positive rate well beyond the 5% you think you're operating at. Check daily for two weeks and your actual false positive rate can exceed 30%.
Key takeaway
Run your test for the duration you planned. Commit to it. If the result looks amazing on day three, it will still look amazing on day fourteen — and if it doesn't, you just saved yourself from shipping a false positive.
5. Lowering Your Significance Threshold to Get a Winner
And now we arrive at what might be the most creative form of self-deception in the entire field.
The test has run. The results are in. p=0.12. Not significant at the conventional 95% threshold. Sad. But then someone says — and I have heard this said, out loud, in a professional context — "what if we use a 70% confidence threshold instead?"
Let's be very clear about what a 70% confidence threshold means: there is a 30% chance your winner is pure noise. You would get better odds betting on a coin flip two times in a row. You have not found a winner. You have found a result that might be a winner, or might be random variation, and you have decided you're fine with not knowing which.
The significance threshold is not a dial you turn until the result you want appears. It is a pre-specified decision rule that you set before the test runs, based on the level of false positive risk you're willing to accept. In most marketing contexts, 95% (p<0.05) is the standard. In higher-stakes or lower-volume contexts, some practitioners use 90%. Below that, you're essentially guessing with extra steps.
Key takeaway
If your test isn't significant, the correct response is to either run it longer (if you're underpowered), accept that there's no detectable effect, or go back and rethink the hypothesis. Not to renegotiate the rules of statistics.
The five deadly sins of A/B testing
No control group
Wrong trigger point
Mismatched metric window
Peeking & stopping early
Lowering the bar
The Merely Bad Mistakes (Seen Constantly)
Not Calculating Sample Size Before You Start
Most A/B tests in the wild are either stopped too early or run too long, and the reason is almost always the same: nobody calculated how large the sample needed to be before launching.
A power calculation tells you, upfront, how many visitors you need in each group to have a reasonable chance of detecting an effect of a given size. It takes three inputs: your baseline conversion rate, the minimum effect size you care about detecting, and your desired statistical power (typically 80%). Without this calculation, you're flying blind — you have no idea whether your test is sensitive enough to find a real effect even if one exists.
Before you launch your next test, run the numbers. The A/B Test Sample Size Calculator will do the maths for you — plug in your baseline and minimum detectable effect and it'll tell you exactly how large your sample needs to be.
Running Tests Only on Partial Traffic Cycles
A test that runs Monday to Friday is not a representative sample of your audience. A test that runs during a promotional period is not representative of normal behaviour. A test that launches immediately after a major email campaign inflates your traffic with a self-selected, highly engaged segment that looks nothing like your typical visitor.
Run tests for complete business cycles — at minimum one full week, ideally two to four. This accounts for day-of-week effects, captures your normal traffic mix, and gives you results that will actually hold when you ship to 100% of traffic.
Testing Too Many Variants Without Correction
More variants means more comparisons. More comparisons means more opportunities for a false positive to sneak through. If you test five variants against a control, the probability that at least one of them looks significant by chance is substantially higher than 5% — even if none of them actually work.
If you're running multi-variant tests, either correct for multiple comparisons (Bonferroni is the simple version), pre-specify which variant is your primary one, or accept that your exploratory results need replication before you act on them.
Confusing Statistical Significance with Practical Significance
p<0.05 tells you the result is unlikely to be noise. It tells you nothing about whether the result is worth acting on.
A 0.2% lift in conversion rate can be statistically significant at p=0.001 if your sample size is large enough. Whether that lift justifies the development cost, the design debt, and the ongoing maintenance of a changed page is an entirely separate question that statistics cannot answer for you. Always ask: even if this is real, does it matter?
Best Practices: What Good A/B Testing Actually Looks Like
Write Your Hypothesis Before You Touch Anything
A well-formed hypothesis has three parts: what you're changing, why you expect it to have an effect, and what metric you expect to move. "We're testing a new button colour" is not a hypothesis. "We believe changing the CTA from grey to high-contrast orange will increase click-through rate because the current button has low visual salience relative to the page background" is a hypothesis.
Writing it down forces you to think about causality, commits you to a primary metric before the data is visible, and gives you something to learn from regardless of the outcome.
Calculate Your Required Sample Size First
Before a single visitor is assigned to a variant, you should know how many you need. Use your historical baseline conversion rate, decide on the minimum lift you'd actually care about (be honest — a 1% improvement on a 2% conversion rate might not be worth shipping), set your desired power at 80%, and run the calculation.
The A/B Test Sample Size Calculator handles this in seconds. If the required sample size is larger than you can realistically collect in a reasonable timeframe, that's important information — it means either your test is underpowered, your minimum detectable effect is unrealistically small, or the hypothesis isn't worth testing right now.
Run for Full Business Cycles and Don't Stop Early
Commit to the sample size you calculated, let the test run for at least one full week (preferably two or more), and do not make stopping decisions based on interim results. Set the end date before you launch. Stick to it.
Check for Sample Ratio Mismatch
Before you trust any results, verify that the traffic was split as intended. If you specified a 50/50 split and you got 52/48, something is wrong — a tracking bug, a bot filter applying unevenly, a caching issue. Run a chi-squared test on your sample sizes. If the split is significantly different from intended, your results are compromised and you need to investigate before drawing conclusions.
Pre-Specify Everything
Primary metric. Secondary metrics. Minimum detectable effect. Sample size. Test duration. Significance threshold. All of it, before the test launches. This is what researchers call pre-registration, and it's the difference between a legitimate experiment and a post-hoc story you told yourself using data.
Once results are in, use the Statistical Significance Calculator to evaluate them against your pre-specified threshold — not to hunt for the threshold that makes your preferred result look significant.
Pre-registration checklist
- ✦Hypothesis: what you're changing and why you expect it to work
- ✦Primary metric: the single number that defines success
- ✦Secondary metrics: supporting signals you'll monitor
- ✦Minimum detectable effect: the smallest lift worth shipping
- ✦Required sample size: calculated from your baseline and MDE
- ✦Test duration: at least one full business cycle
- ✦Significance threshold: set at 95% (p<0.05) before launch
The Maths Bit (You Can't Skip This One)
What a P-Value Actually Means
We've already seen p-values invoked (and abused) throughout this article. Here's the precise definition: the p-value tells you the probability of observing a result at least as extreme as yours, assuming the null hypothesis is true — i.e., assuming there is no real effect.
It does not tell you the probability that your hypothesis is true. It does not tell you the probability that the result occurred by chance. p=0.05 means: if there were genuinely no difference between your variants, you'd still see a result this large about 5% of the time due to random variation. That's it. It's a statement about the data under a hypothetical world, not a statement about your specific test.
Statistical Power: The Thing Most Marketers Ignore
Power is the probability that your test will detect a real effect, given that one actually exists. At 80% power — the conventional minimum — you have a 20% chance of running a valid test on a real improvement and concluding there's no effect. That's not a great number, but it's the baseline.
Low-powered tests are dangerous not because they produce false positives, but because they produce false negatives — you make a real improvement to your site, the test says "no significant effect," and you roll it back. Or worse, you keep peeking until random noise produces a positive result, and you ship something that doesn't actually work.
Power is determined by three things: sample size (bigger is better), effect size (larger effects are easier to detect), and significance threshold (lower thresholds require more power to achieve). The only one you typically control is sample size, which is why calculating it upfront matters so much.
One-Tailed vs Two-Tailed Tests
A two-tailed test asks: "Is variant B different from control, in either direction?" A one-tailed test asks: "Is variant B better than control?"
The temptation to use one-tailed tests is real — they require a smaller sample size to reach significance, and you're usually only interested in improvements, not degradations. The problem is that one-tailed tests are only appropriate when it is genuinely impossible, or theoretically incoherent, for the effect to go in the other direction. In practice, almost any change to a website could make things worse, which means the two-tailed test is almost always the right choice.
Using a one-tailed test because it's more likely to reach significance faster is not a methodological choice — it's a way of lowering your effective significance threshold without admitting you've done it. Use two-tailed tests. Be honest with yourself. If your change makes things worse, you want to know.
A Note of Grudging Optimism
Here's the thing: good A/B testing is not actually hard. It requires discipline more than it requires expertise. Define your hypothesis. Calculate your sample size. Run the test for long enough. Don't peek. Don't adjust the rules when you don't like the results. Evaluate against pre-specified criteria.
That's it. That's the whole practice. The fact that so many tests in the wild fail to meet even this basic bar is not a testament to the difficulty of statistics — it's a testament to the difficulty of slowing down in a culture that rewards shipping fast and declaring wins.
The marketers who do this well are not necessarily the ones with the strongest statistical backgrounds. They're the ones who've internalised that a bad test is worse than no test — because a bad test gives you false confidence, and false confidence at scale is expensive.
Run fewer tests. Run them properly. The data will actually mean something.
Start here
Need to calculate whether your test has enough statistical power before you launch? Use the A/B Test Sample Size Calculator. Already have results and want to know if they're significant? Try the Statistical Significance Calculator.
Browse all tools →