General Guideline for Conduction A/B Testing

3 min readApr 18, 2021

Knowledge Sharing from my experience of performing A/B Testing

Steps to perform A/B Testing

Ensure you have the data from the current environment before starting the experiment.
Choose one feature/factor to be tested: the other factors should be the same in two variants. For instance, changing UI by adding one page with property details, changing color of the headline on company website
Identify specific samples to be tested on: Who is the target population? E.g. “landlord agents” should be one of the samples when we do the experiment on changing UI for property details
Identify “Metrics” or “Overall Evaluation Criterion (OEC)” to be measured and compared from these two variants, which would potentially depend on selected feature/factor (from previous bullet). Such as, “time duration”, “#Sessions per user”, “Conversion Rate”, “income per day”, etc.

Need to be cautious on downside of a particular metric

Also specify Experimental Unit, which corresponds to the chosen metric, such as, “user” (i.e. “renter”, “agent”), “session”, “group of users”
Indicate Hypothesis on this A/B Testing. For example, Null Hypothesis (H0): mean of these metrics are pretty much the same. Alternative Hypothesis (Ha): mean of these metrics are statistically significantly different
Calculate the Minimum Sample Size (MSS) required → reflect on duration to run the experiment. We need to know these variables to calculate the MSS in each group

Significance level (alpha): commonly use alpha = 0.05
Statistical Power (1-beta): commonly use 1-beta = 0.8
Variance of two groups
Minimum Detectable Effect: delta = |Mean1 — Mean2| or |decimal percent1 — decimal percent2|

https://www.stat.ubc.ca/~rollin/stats/ssize/n2.html

https://www.calculator.net/sample-size-calculator.html

Split the samples into two groups equally and randomly
Start the experiment
Verify whether there is sample-ratio-mismatch: compare the sample size between control and target groups (should be similar)

https://www.gigacalculator.com/calculators/chi-square-calculator.php?test=goodnessoffit&data=15752+0.5%0D%0A15257+0.5

If p-value >= 0.05 → fine. Unless, continue running until getting a similar proportion.

Analyze the data after running the experiment for at least required amount of time. Compute Confidence Interval (CI) of difference (on that selected metric) between control and target groups. Using distinct formulas when Having same vs different variances, Difference in metric is absolute difference (i.e. mean as metric) vs percentage difference (i.e. conversion rate as metric), or conduct the statistical T-test or Z-test to identify whether the new feature would cause any statistically significant difference

https://sphweb.bumc.bu.edu/otlt/mph-modules/bs/bs704_confidence_intervals/bs704_confidence_intervals5.html

Rules of thumb on running A/B Testing

Test one changing feature at a time
Each experiment unit in the sample should be assigned to one group randomly, and that unit should encounter the same variant through the experiment (make it consistent).
Ensure to have sufficient statistical power

Good Practice before running A/B Testing

Having the relevant data from the control group before running the experiment.
Conduct A/A Testing: divide our sample into 2 groups, and serve both of them with the current variant. Values of the metric should not be statistically significantly different. T-test or Z-test should not be statistically significant → unable to reject null hypothesis. Or confidence Interval of difference in metric should include zero. To verify that the current version of the variant itself doesn’t cause any difference in our target population.

Warning for special cases of A/B Testing

“Seasonality Effect”: Weekday vs weekend → run the experiment for full week. Different users engagement during holiday
“Novel Effect”: Too high users engagement: users are hyped for the new features. or too low users engagement: users take time to learn on the new features
“Cannibalization”: value of metric in target group overcounts value of metric in control group. For example, Having rental price suggestions in the target group → users engage more in the target group. But when push 100% on production, users engagement doesn’t increase as expected.
“Network violation Effect”: When launch new feature on the social network, users don’t have this feature as their friends. Define group of users in the same network (or demographic) as experiment unit
“bots/robots filtering”: don’t forget to filter non-user data out
“Click-Through-Rate (CTR)”: be careful of using it. Tend to violate standard i.i.d. Assumptions. Hard to refer to real user behavior

Other limitations of A/B Testing in general

No obvious explanation for causality: causality is formed by analyzing the relationship between independent variable and results of dependent metric
“Long-term effect”: the experiment result describes just the causality in short-term

General Guideline for Conduction A/B Testing

Written by Supapitch Kittisarakul