skip to content
Ben Lau statistics . machine learning . programming . optimization . research

A/B Testing

3 min read Updated:

Resources

Introduction

The idea is to test the behaviour change between two groups. If confidence intervals of the metrics of the treatment group and the control group do not overlap, we can conclude that the treatment group is better than the control group. Or if the confidence interval of the cumulative treatment effects does not overlap with 0, we can conclude that the treatment has a significant effect. We could use the following methods:

  • t-test
  • Regression discontinuity: point comparison, only for immediate effect
  • Difference-in-differences: aggregate the effects
  • Bayesian Structural Time Series (Causal Impact): aggregate the effects, isolate some latent factors, e.g. seasonality, trend, etc.

In Bayesian AB testing, we can use the posterior distribution to calculate the probability of the treatment group being better than the control group by just calculating the proxy equation on each pair of samples from the posterior distributions and take an average source. E.g. (blue_button_conversion_rate_samples > red).mean().

References:

Typical steps

  1. Define the metrics - could be retention, conversion, etc. or could be a proxy metric, e.g. adding 7 friends in first 10 days.
  2. Determine the sample size by stating false positive rate, true positive rate, base line, and effect size - should be done by online calculator
    • False positive rate: Type I error rate, significance level, alpha, e.g. 0.05
    • True positive rate: 1 - Type II error rate, 1 - FNR, power, recall, e.g. 0.8
    • Base line: the current metric, e.g. 0.1 conversion rate
    • Effect size: the minimum detectable effect, magnitude of the difference, e.g. 0.02, considering it gives practical significance
  3. Randomize the experiment correctly - avoid imbalance in the treatment and control groups, maybe use stratified sampling.

Notes

  1. Do not peak into the data before the experiment ends.
  2. Do not stop the experiment early.
  3. Adjust for multiple comparisons - Bonferroni correction, Holm-Bonferroni correction, etc.
  4. Statistical significance does not equal to practical significance.

Readings

Readings