The F-Test: Detecting A/B Test Interactions and Conditional Treatment Effects

The Partial F-Test: Solving for A/B Test Interactions

Test, Learn, Optimize

Your resource to smarter customer experiences

Search website

The partial F-test: Solving for A/B test interactions and conditional treatment effects

16 min read • Matt Gershoff •<br>Experimentation, Platform

Moving beyond the standard t-test in experimentation

When it comes to advanced A/B testing, product and data teams often need to measure complex A/B test interactions and conditional treatment effects (CATE) in order to understand how different users actually respond. Yet, somewhat surprisingly, almost all the analysis in experimentation platforms, even those with extra complexities such as CUPED, is ultimately reduced to some form of univariate test, most often the basic t-test. The t-test is most useful when we have a single hypothesis we want to test. For example, does some treatment perform better than a control, or maybe to check if two simple A/B tests are interacting with one another. However, there are many cases where what is needed is a multivariate test – a way to test more than one null hypothesis at the same time.

Is there such a test? Yes! It is the partial F-test for nested linear regression models. The partial F-test is perhaps one of the most useful, underappreciated, and underused approaches in experimentation. It is such a flexible framework that all of the following can be recast as partial F-tests of nested regression models:

A/B t-tests

omnibus ANOVA/ANCOVA tests

conditional treatment effects (CATE)

and even interaction effects between different concurrent A/B tests

So, if you want to be able to answer questions like “Do my A/B test results differ by customer?” and/or “Are my A/B tests interacting with each other?”, then read on!

TL;DR: The 5 steps of a partial F-test

If you already know about multivariate regression and nested models, and would just like a TL;DR version, here are the basic steps for the (homoscedastic) partial F-test:

1. Compare two regression models:<br>a. A more complex ‘full’ regression model;<br>b. A nested, simpler ‘partial’ regression model.

2. See how well each ‘predicts’ the outcome data based on the residual sum of squares.

3. Compare the models’ results using an F-statistic.

4. Find the area under the cumulative F-distribution wrt the F-statistic to calculate the associated p-value.

If that was clear, great, jog on. If not, there are a few key concepts to cover before getting into the details of running the F-test. First, we need to review the linear model and nesting.

Understanding the linear model in A/B testing

By linear model, we mean a model that can be solved with OLS regression. These are models in the form of Y = Xβ where X is the design matrix (the data about the experiment, such as test assignment and any possible covariates) and β are the estimated weights (the things we want to learn from the experiment).

A simple A/B test in this framework can be modeled as Y = β0 + β1 * Treat, where ‘Treat’ is a binary indicator (dummy) variable encoding, with a value of ‘1’ for participants who received the treatment and ‘0’ for those in the control group.

The regression weight for treatment β1 is the estimate for the average treatment effect (ATE). The ATE is the difference in the estimated value of the outcome measure, Y, between those in the control and treatment groups.

The standard t-test approach evaluates if the null H0 : β1 = 0, should be rejected by seeing how far β1 / St_err(β1) is from zero. When it is beyond a critical value (e.g., abs(1.96)), then we reject H0 (the null); else we ‘fail to reject’ H0.

Let’s look at a very simple A/B test example where we have a revenue measure, and subjects have been assigned to treatment or control.

RevenueTreatment$ 4.900$ 8.520$ 6.340$ 5.370$ 2.000$ 3.930$ 8.931$ 6.151$ 10.061$ 6.621$ 8.111$ 7.951

Using Excel (yeah, yeah, insert a joke about Excel, but feel free to use whatever deterministic stats software you like), I first calculated the mean values for the treatment and control groups. Then I regressed revenue onto treatment to get the following results:

The mean value for each test arm and the average Treatment effect (ATE)

Test armMeansControl$ 5.18Treatment$ 7.97ATE $ 2.79

The regression ANOVA table

ANOVAdfSSMSFSignificance FRegression 123.42123.4216.747 0.027 Residual1034.7153.471Total1158.135

And the Coefficients (regression weights) table

CoefficientsStandard Errort StatP-valueIntercept$ 5.180.7616.8040.000Treatment $ 2.79 1.0762.597 0.027

A few things to notice:

The regression coefficient for the Treatment indicator variable is the same as the ATE calculated from the difference in means.

The t-statistic from the coefficients table, 2.597, is the square root of the F-statistic in the ANOVA table: sqrt(6.747) = 2.597.

And the P-values for the t-test and the F-test are exactly the same! In other words, we could (with a tiny bit of fiddling for dealing with signs in one-tailed...

The F-Test: Detecting A/B Test Interactions and Conditional Treatment Effects

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

It's Not Just X. It's Y

Amazon, Facebook, FBI have access to a private intelligence-sharing network

Show HN: GoPeek – open links in live mini browser windows without new tabs

Agent Memory: An Anatomy