A/B Testing
Resources
- udemy | Bayesian Machine Learning in Python: A/B Testing
- udemy | A/B Testing and Experimentation for Beginners
- udemy | Coding for A/B testing: Run more AB tests, find more winners
- Selecting the best artwork for videos through A/B testing
Introduction
The idea is to test the behaviour change between two groups. If confidence intervals of the metrics of the treatment group and the control group do not overlap, we can conclude that the treatment group is better than the control group. Or if the confidence interval of the cumulative treatment effects does not overlap with 0, we can conclude that the treatment has a significant effect. We could use the following methods:
- t-test
- Regression discontinuity: point comparison, only for immediate effect
- Difference-in-differences: aggregate the effects
- Bayesian Structural Time Series (Causal Impact): aggregate the effects, isolate some latent factors, e.g. seasonality, trend, etc.
In Bayesian AB testing, we can use the posterior distribution to calculate the probability of the treatment group being better than the control group by just calculating the proxy equation on each pair of samples from the posterior distributions and take an average source. E.g. (blue_button_conversion_rate_samples > red).mean()
.
References:
- Bayesian A/B Testing
- Introduction to Bayesian A/B Testing
- Microsoft Experimentation Platform Publications
Typical steps
- Define the metrics - could be retention, conversion, etc. or could be a proxy metric, e.g. adding 7 friends in first 10 days.
- Determine the sample size by stating false positive rate, true positive rate, base line, and effect size - should be done by online calculator
- False positive rate: Type I error rate, significance level, alpha, e.g. 0.05
- True positive rate: 1 - Type II error rate, 1 - FNR, power, recall, e.g. 0.8
- Base line: the current metric, e.g. 0.1 conversion rate
- Effect size: the minimum detectable effect, magnitude of the difference, e.g. 0.02, considering it gives practical significance
- Randomize the experiment correctly - avoid imbalance in the treatment and control groups, maybe use stratified sampling.
Notes
- Do not peak into the data before the experiment ends.
- Do not stop the experiment early.
- Adjust for multiple comparisons - Bonferroni correction, Holm-Bonferroni correction, etc.
- Statistical significance does not equal to practical significance.
Readings
- Crash Course On A/B Testing For Product Managers
- Find the Key to Your App’s Growth Without an Army of Data Scientists -Facebook successfully used “adding 7 friends in first 10 days” as a proxy metric to predict user retention after 2 month.