The small sample trap in A/B testing

mustaphah1 pts0 comments

Averages lie | Mustapha Hadid<br>~/Posts/<br>Averages lie<br>Suppose you ran an A/B test on the signup page of your app. You wanted to ship a new version, but you weren&rsquo;t sure<br>if it is actually better, so you decided to A/B-test it. You did the experiment, and you&rsquo;ve got these numbers:<br>Baseline conversion: 3%<br>New version conversion: 3.5%<br>Things looks good. The conversion rate is better, so the new version must be better, right? Well, not necessarily.<br>Let&rsquo;s say there are three realities of this experiment. Three companies, exact same experiment and result, but<br>different number of visitors in their experiments. The first had 100 visits per version - the original and new signup<br>page; 200 visits in total. The second had 1,000 per version. The third had 10,000 per version.<br>Which reality has the highest chance that the new version is actually better? By intuition, most would agree it should<br>be the 10,000-customer one. By the end of this post, we&rsquo;ll quantify the exact probability that the new version is<br>actually better for each one of these realities. More importantly, we&rsquo;ll explore why averages could deceive us into<br>conclusions the data don&rsquo;t warrant, potentially leading us into a worse off situation.<br>A quick intuition first<br>Before any math, let start with an intuition. You flipped a coin 4 times and get 3 heads; the average heads is 75%.<br>Would you assume the coin is biased? Absolutely not, 3 out of 4 happens all the time with a fair coin. Now you flipped<br>the same coin 4000 times and got 3000 heads, same 75%, how do you feel this time? the coin can&rsquo;t possibly be fair,<br>right?<br>That&rsquo;s the law of large numbers : as you take more samples (flipping more coins), the sample average (heads or tails)<br>converges to the true average (that is, 50%). The more sample, the closer the observed average (the one we measure) gets<br>to the real one. Hence, with 10 samples, the proportions of heads could be 65%, with 100 samples, maybe 55%, but as we<br>increase the samples more and more, it should approach ~50%.<br>Back to the example, when we flipped the coin 4 times, the average heads was 75% and that was totally plausible, but for<br>the 4000-flips case, it&rsquo;s almost impossible to get that average with a fair coin; the probability is just insanely low<br>to the point that we can logically assume the coin is not fair, and we would be right.<br>You should see now why the average alone tells very little. The average plus the sample size reveals whether the<br>observed average is signal or noise.<br>Same thing with A/B tests. We&rsquo;re going to walk through why a 3% (original) vs 3.5% (experiment) could mean different<br>things based on the number of visitors. It could mean the new version is better, but it could also mean the whole<br>experiment is just measuring the noise where the actual effect is unknown from the data: maybe better, neutral, or even<br>detrimental; without sufficient sample size, you can never be sure enough of the true effect.<br>NOTE: Since we&rsquo;re dealing with a binary dataset, proportion = average. I&rsquo;ll use both term interchangeably. Keep in<br>mind, the concepts we&rsquo;re applying here to proportions (conversion rate) does apply to averages in general.

The standard error<br>The new &ldquo;3.5%&rdquo; rate we observed from the experiment is the conversion rate of the visitors who happened to land in the<br>experiment. The thing you care about; the rate you&rsquo;d see if all your future visitors used this version, is unobservable<br>future. You&rsquo;re trying to guess it from a sample.<br>If you ran the same A/B test twice, on two different groups of visitors, you&rsquo;d get slightly different numbers each time.<br>So the question is, how to tell the difference in conversion rate is due to the experiment and not due to an expected<br>noise in measurement.<br>The standard error of a sample average/proportion tells you, roughly, how much that proportion would change if you<br>re-ran the experiment, just by pure chance. Same signup page with the same number of visitors, just different people.<br>For proportions, like the conversion rate, the formula is:<br>$$ SE = \sqrt{\frac{p(1-p)}{n}} $$<br>Where n is the number of visitors. The p is the observed proportion; you can also think of it as the probability<br>that any random visitor would convert.<br>Let&rsquo;s compute it for the baseline conversion rate of reality 1 (100 visitors):<br>$$ SE = \sqrt{\frac{0.03 \cdot 0.97}{100}} \approx 1.71 $$<br>The value of SE is the standard deviation of the average.<br>What this tells you essentially is that if you just repeat the measurement for the original signup page with exactly<br>100 visitors, it&rsquo;s more likely than not to fall within 3% ± SE (1.29 - 4.71%). The difference is driven by noise;<br>absolute noise, zero signal.

As shown by the graph, the range p ± SE covers ~68.2% of outcomes. Simply put, if you repeat the measurement 100<br>times, 68 of the 100 conversion rates would fall within that range; remaining 32 fall outside the SE range. If...

rsquo average version experiment conversion visitors

Related Articles