The small sample trap in A/B testing

Averages lie | Mustapha Hadid ~/Posts/ Averages lie Suppose you ran an A/B test on the signup page of your app. You wanted to ship a new version, but you weren’t sure if it is actually better, so you decided to A/B-test it. You did the experiment, and you’ve got these numbers: Baseline conversion: 3% New version conversion: 3.5% Things looks good. The conversion rate is better, so the new version must be better, right? Well, not necessarily. Let’s say there are three realities of this experiment. Three companies, exact same experiment and result, but different number of visitors in their experiments. The first had 100 visits per version - the original and new signup page; 200 visits in total. The second had 1,000 per version. The third had 10,000 per version. Which reality has the highest chance that the new version is actually better? By intuition, most would agree it should be the 10,000-customer one. By the end of this post, we’ll quantify the exact probability that the new version is actually better for each one of these realities. More importantly, we’ll explore why averages could deceive us into conclusions the data don’t warrant, potentially leading us into a worse off situation. A quick intuition first Before any math, let start with an intuition. You flipped a coin 4 times and get 3 heads; the average heads is 75%. Would you assume the coin is biased? Absolutely not, 3 out of 4 happens all the time with a fair coin. Now you flipped the same coin 4000 times and got 3000 heads, same 75%, how do you feel this time? the coin can’t possibly be fair, right? That’s the law of large numbers : as you take more samples (flipping more coins), the sample average (heads or tails) converges to the true average (that is, 50%). The more sample, the closer the observed average (the one we measure) gets to the real one. Hence, with 10 samples, the proportions of heads could be 65%, with 100 samples, maybe 55%, but as we increase the samples more and more, it should approach ~50%. Back to the example, when we flipped the coin 4 times, the average heads was 75% and that was totally plausible, but for the 4000-flips case, it’s almost impossible to get that average with a fair coin; the probability is just insanely low to the point that we can logically assume the coin is not fair, and we would be right. You should see now why the average alone tells very little. The average plus the sample size reveals whether the observed average is signal or noise. Same thing with A/B tests. We’re going to walk through why a 3% (original) vs 3.5% (experiment) could mean different things based on the number of visitors. It could mean the new version is better, but it could also mean the whole experiment is just measuring the noise where the actual effect is unknown from the data: maybe better, neutral, or even detrimental; without sufficient sample size, you can never be sure enough of the true effect. NOTE: Since we’re dealing with a binary dataset, proportion = average. I’ll use both term interchangeably. Keep in mind, the concepts we’re applying here to proportions (conversion rate) does apply to averages in general.

The standard error The new “3.5%” rate we observed from the experiment is the conversion rate of the visitors who happened to land in the experiment. The thing you care about; the rate you’d see if all your future visitors used this version, is unobservable future. You’re trying to guess it from a sample. If you ran the same A/B test twice, on two different groups of visitors, you’d get slightly different numbers each time. So the question is, how to tell the difference in conversion rate is due to the experiment and not due to an expected noise in measurement. The standard error of a sample average/proportion tells you, roughly, how much that proportion would change if you re-ran the experiment, just by pure chance. Same signup page with the same number of visitors, just different people. For proportions, like the conversion rate, the formula is: $$ SE = \sqrt{\frac{p(1-p)}{n}} $$ Where n is the number of visitors. The p is the observed proportion; you can also think of it as the probability that any random visitor would convert. Let’s compute it for the baseline conversion rate of reality 1 (100 visitors): $$ SE = \sqrt{\frac{0.03 \cdot 0.97}{100}} \approx 1.71 $$ The value of SE is the standard deviation of the average. What this tells you essentially is that if you just repeat the measurement for the original signup page with exactly 100 visitors, it’s more likely than not to fall within 3% ± SE (1.29 - 4.71%). The difference is driven by noise; absolute noise, zero signal.

As shown by the graph, the range p ± SE covers ~68.2% of outcomes. Simply put, if you repeat the measurement 100 times, 68 of the 100 conversion rates would fall within that range; remaining 32 fall outside the SE range. If...

The small sample trap in A/B testing

Related Articles

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play

Old Reddit Is Down

The ultimate female fantasy – A feminist critique of Beauty and the Beast