You are on page 1of 24

How I Became Agoda’s

A/B Testing Police


anthony@arithmox.ai
Plan of Attack

1. My Data Science Journey

2. Challenges to Effective A/B Testing


1. My Data Science
Journey
BA/MPhil in Economics @ Cambridge

● Better named Mathematical Economics.

● Specialised in economic theory and econometrics.

● Wrote two theses on unemployment persistence.

● Internship at Bank Indonesia in 2013.


MSc in Applied Statistics @ Oxford

● Students expected to be experts in R.


● Emphasised Bayesian statistics and
high-performance computing.
● Supervised by Yee Whye Teh to write on
Bayesian Neural Networks.
Algorithmic Trader @ Seamless

● Secretive trading startup in Cambridge.


● High emphasis on Python for ML.
● Competitive supervised-learning problems.
● Best practices: Agile, Clean Code, TDD.
Lead Data Scientist @ Agoda
● Automated bidding for MSE.
● Personalised gift cards.
● Experiment Platform:
→ Experimental design and guidelines.
→ Formulating A/B tests as KPIs.
→ Taught “Statistics for Product Owners.”
Co-Founder @ Arithmox

● Optimisation and automation specialists.


→ Emphasis on measurement and impact.
→ Start simple, then iterate!
→ North star: total financial impact on clients.

● Focus on manufacturing and agriculture.


2. Challenges to
Effective A/B Testing
Recap: A/B Test Setup
What Do We A/B Test?

Test categories:

a. Feature changes
b. Bug fixes
c. Technical tasks
d. Negative experiments
Agoda’s Data-Driven Culture

● Test results are KPIs:


→ Failed tests are diagnosed.
→ Again: no win, no production!

● “In God we trust, all others bring data!”


→ The data is not wrong.
→ No gut-feeling decisions.
Effective Experimentation

Must contribute to these objectives:


✓ Accelerate growth.
✓ Learning for future experimentations.
✓ Performance accountability.
✓ Automated stop loss as safety net.
A/B Testing As An Organisation

In an organisation with X00 people and X000 tests/quarter:

→ The system must not be gameable.

→ Small improvements scale to many teams.


Let’s Simulate The Organisation!

Simulate 100,000 experiments with:

→ Normally distributed true lift with sd 0.01.


→ Stopping criterion of 10,000 clickers in A.
→ Simulate one run so we have an observed lift.
→ Greedy take rule.
Inevitable Decision Mistakes

● Two types of errors:


→ Should take, but didn’t.
→ Shouldn’t take, but did.

● Potential problems:
→ Bad practice: p-hacking.
→ Low-power tests.
P-Hacking: Take Wins, Drop Losses

Reference: A/B testing for conversion rate, revisited


Even Worse: Bad Incentives

P-hacking works even better


with short tests.

⇒ Causes bad incentives!


What Should EP Do?
● Count losing tests to KPIs?
● Enforce minimum run durations?
→ How long? Does it depend on the specific segments?
→ What about velocity?
● Reward long run durations?
→ By how much? How to do this from principles?
AirBnB’s Solution To Take Bias
The Effects of Variance Reduction
Reducing Variance with CUPED
Intuition for CUPED

The average user visits once every two months:

1. Suppose a user has been consistently visiting everyday.


2. When he is allocated to side B, he visits only once a week.
3. Should we reward or penalise B?
Conclusion

A/B Testing is Extremely Important

→ Provides accountability and promotes a data-driven culture.


→ It is an instrument for growth and learning.
→ Easy to do, hard to be effective in.

You might also like