A/B Testing • Ben Lau

Resources

Introduction

The idea is to test the behaviour change between two groups. If confidence intervals of the metrics of the treatment group and the control group do not overlap, we can conclude that the treatment group is better than the control group. Or if the confidence interval of the cumulative treatment effects does not overlap with 0, we can conclude that the treatment has a significant effect. We could use the following methods:

t-test
Regression discontinuity: point comparison, only for immediate effect
Difference-in-differences: aggregate the effects
Bayesian Structural Time Series (Causal Impact): aggregate the effects, isolate some latent factors, e.g. seasonality, trend, etc.

In Bayesian AB testing, we can use the posterior distribution to calculate the probability of the treatment group being better than the control group by just calculating the proxy equation on each pair of samples from the posterior distributions and take an average source. E.g. (blue_button_conversion_rate_samples > red).mean().

References:

Typical steps

Define the metrics - could be retention, conversion, etc. or could be a proxy metric, e.g. adding 7 friends in first 10 days.
Determine the sample size by stating false positive rate, true positive rate, base line, and effect size - should be done by online calculator
- False positive rate: Type I error rate, significance level, alpha, e.g. 0.05
- True positive rate: 1 - Type II error rate, 1 - FNR, power, recall, e.g. 0.8
- Base line: the current metric, e.g. 0.1 conversion rate
- Effect size: the minimum detectable effect, magnitude of the difference, e.g. 0.02, considering it gives practical significance
Randomize the experiment correctly - avoid imbalance in the treatment and control groups, maybe use stratified sampling.

Notes

Do not peak into the data before the experiment ends.
Do not stop the experiment early.
Adjust for multiple comparisons - Bonferroni correction, Holm-Bonferroni correction, etc.
Statistical significance does not equal to practical significance.

Readings

Crash Course On A/B Testing For Product Managers
Find the Key to Your App’s Growth Without an Army of Data Scientists -Facebook successfully used “adding 7 friends in first 10 days” as a proxy metric to predict user retention after 2 month.