Professional Documents
Culture Documents
Get unlimited access to the best of Medium for less than $1/week. Become a member
AB Testing 101
What I wish I knew about AB testing when I started my career
312 2
So I should have known a lot about AB testing, right? Wrong! I was surprised
just how little I knew after I joined the AB testing platform company Eppo.
Below is a synthesis of what I wish I’d known about AB testing when I started
my career.
What is AB Testing?
First things first, let’s define what AB testing is (also known as split testing).
To quote Harvard Business Review, “A/B testing is a way to compare two
versions of something to figure out which performs better.”
Source: https://vwo.com/blog/ab-testing-examples/
For the rest of the article, I’ll be assuming we’re working on AB testing a web
or mobile product.
Feature flagging and randomization: how you determine who sees what
variant (e.g., which experience or features)
Statistics: the fancy math way of determining whether or not a metric for
a given variant is better than another variant
Randomization is about “rolling the dice” and figuring out which variant the
user sees. It sounds simple but it’s actually complicated to do well. Let’s start
with the “naive” version of randomization to illustrate the complexity.
A user hits the home page and we have a test we’re running on the main
copy, with variants control and test. The backend has to determine which
one to render to the user. Here’s how the randomization works for a user
visiting the home page:
1. Lookup a user’s variant from the database. If it exists, use the recorded
variant. If it doesn’t exist, then…
4. Save the variant to the database (so it can be looked up in step 1 the next
time the user visits the home page)
5. Use the variant to determine what copy to render on the home page
3. Race condition. If a user opens your home page multiple times at the
same time (say from a Chrome restore), then it’s indeterminant which
variant the user is going to see and have recorded. This is because the
lookups from step 1 may all return nothing, so each page load rolls the
dice and assigns a different variant. It’s random which one actually wins
the race to be saved to the database and the user is potentially served
different experiences.
After you’ve computed the variant, you log the result, but instead of writing
to a transactional database, which is blocking, you write the result to a data
firehose (such as AWS Kinesis) in a non-blocking way. Eventually the data
makes its way into a table in your data lake/warehouse for analysis (often
called the “assignments” table).
Ok, so why do I need a feature flagging tool? Can’t I just implement this
hashing logic myself? Yes, you could (and we did at Storyblocks back in the
day) but there are some downsides
So what do we do? Feature flagging! I won’t go into it in full detail here, but
feature flagging solves these issues for us, by combining the best of both
worlds: the ability to opt specific groups of users into a test and the ability to
randomize everyone else. There’s a great Eppo blogpost that describes what
goes into building a global feature flagging service if you want to learn more.
Metrics
Metrics are probably the easiest part of AB testing to understand. Each
business or product typically comes with its own set of metrics that define
user engagement, financial performance and anything else you can measure
that will help drive business strategy and decisions. For Storyblocks, a stock
media site, that was 30-day revenue for a new signup (financial), downloads
(engagement), search speed (performance), net promoter score (customer
satisfaction) and many more.
The naive approach here is simply to join your assignments table to other
tables in your database to compute metric values for each of the users in
your experiment. Here are some illustrative queries:
-- etc.
1. As your user base grows, the ad hoc joins to your assignments table
become repetitive and expensive
2. As the number of metrics grows, the SQL to compute them becomes hard
to manage (just imagine having 50 or even 1000 of these)
Let me explain the event/fact layer in more detail. A critical aspect to making
metrics easily reproducible and measurable is to base them on events or
“facts” that occur in the product or business. These should be immutable
and have a timestamp associated with them. At Storyblocks those facts
included subscription payments, downloads, page views, searches and the
like. The metric for 30-day revenue for a new signup is simply an operation
(sum) on top of a fact (subscription payments). Number of searches is simply
a count of the number of search events. And so on. A company like Eppo
makes these facts and other definitions a core part of your AB testing
infrastructure and also provides the capabilities for computing assignments
once and building out a fact/metric repository.
Statistics
Ok, statistics. This is the hardest part for someone new to AB testing to
understand. You’ve probably heard that we want a p-value to be less than
0.05 for a given metric difference to be statistically significant but you might
not know much else. So I’m going to start with the naive approach that you
can find in a statistics 101 textbook. Then I’ll show what’s wrong with the
naive approach. Finally, I’ll explain the approach you should be taking.
There will also be a bonus section at the end.
Let’s assume we’re running the home page test for YourDelivery shown
above, with two variants control (blue) and test (red) with an even 50/50 split
between them. Let’s also assume we’re only looking at one metric, revenue.
Every user that visits the home page will be assigned to one of the variants
and then we can compute the revenue metric for each user. How do we
determine if there’s a statistically significant difference between test and
control? The naive approach is simply to use a Student t-test to check if
there’s a statistical difference. You compute the mean and standard deviation
for test and control, plug them into the t-statistic formula, compare that
value to a critical value you look up, and voila, you know if your metric, in
this case revenue, is statistically different between the groups.
Let’s dive into the details. The formula for the classic t-statistic is as follows:
t-statistic
To look up the critical value for a given significance level (typically 5%), you
need to know the degrees of freedom. However for large sample sizes that
we typically have when we’re AB testing, the t-distribution converges to the
normal distribution so we can just use that to look up the critical value. The
parameters for that normal distribution under the null hypothesis (i.e. there
is no difference between the groups) are:
The naive approach seems sound, right? After all, it’s following textbook
statistics. However there are a few major downsides:
1. The “Peeking Problem”. The classic t-test only guarantees its statistical
significance if you look at the results once (e.g., a fixed sample size).
More details below.
So if you’re running a test for 2+ weeks and checking results daily, then to get
a true 5% significance, you need to raise your significance to be ≤ 1%. That’s
a pretty big change and represents at least a full standard deviation of
difference from the naive approach.
Ok, now that we know some pitfalls of the naive approach, let’s outline key
aspects of the way we should approach the statistics for our AB testing (I’ll
include more info about each below the list in separate sections).
The rationale behind relative lifts is straight forward: we typically care about
relative changes instead absolute changes and they’re easier to discuss. It’s
easier to understand a “5% increase in revenue” compared to a “$5 increase
in revenue per user”.
How does the math change for relative lifts? I’m going to quote from Eppo’s
documentation on the subject. First, let’s define relative lift:
From the central limit theorem, we know that the treatment and control
means are normally distributed for large sample sizes. This allows us to
model the relative lift as a normal distribution with the following
parameters:
First, let’s start with the confidence interval using a visual representation
from an Eppo experiment dashboard:
Confidence interval for a single metric
So you can see that the “point estimate” is a 5.9% lift, with a confidence
interval of ~2.5% on either side representing where the true relative lift
should be 95% (one minus the typical significance of 5%) of the time. These
are much easier for non-statisticians to interpret than p-values — the visuals
really help illustrate the data and statistics together.
Source: https://docs.geteppo.com/statistics/cuped/
The math is complicated so I won’t bore you with the details. Just know that
powerful AB testing platforms like Eppo provide CUPED implementations
out of the box.
While I didn’t fully write out the math for sequential confidence intervals,
know that we need to compute the number of users, the mean, and the
standard deviation of each group, treatment and control, and we can plug
those in to the various formulas.
As you can see, we must first compute the mean and then go back and
compute the standard deviation. That’s computationally expensive because it
requires two passes. But there’s a reformulation we can employ to do the
computation in one pass. Let me derive it for you:
Ok, that looks pretty complicated. The original formula seems simpler.
However you’ll notice we can compute these sums in one pass. In SQL it’s
something like:
SELECT count(*) as n
, sum(revenue) as revenue
, sum(revenue * revenue) as revenue_2
FROM user_metric_dataframe
Perhaps you’ve heard of Bayes’ theorem before but you’ve likely not heard of
Bayesian statistics. I certainly never heard about it until I arrived at Eppo. I
won’t go into the details but will try to provide a brief overview.
In Bayesian statistics, you have a belief about your population and then the
observed data. Let’s simplify this to “belief” and “data” and write Bayes’
theorem slightly differently.
So basically you use the likelihood to update your prior, giving you the
posterior probability (ignoring for a moment the normalization factor, which
is generally hard to compute).
Drawing conclusions
Drawing conclusions is the art of AB testing. Sometimes the decision is easy:
diagnostics are all green, your experiment metrics moved in the expected
direction and there were no negative impacts on guardrail metrics. However,
studies show that only around 1/3 of experiments produce positive results. A
lot of experiments might look similar to the report card below for a “New
User Onboarding” test:
TotalUpgradestoPaidPlan
1.76 1.87 5.90%
Primary
~Coreusermetrics
Metric
Acontrol Lift -20% -10% 0% 10% 20%
The primary metric “Total Upgrades to Paid Plan” is up ~6% while there are
some negative impacts such as “Site creations” being down ~10%. So what do
you do? Ultimately, there’s no right answer. It’s up to you and your team to
make the tough calls in situations like this.
There’s also a variety of other ways your data could be off: one or more of
your metrics may not have data; there could be an SRM for a particular
dimension value; there may not be any assignments at all; there might be an
imbalance of pre-experiment data across variants; and more.
Tools like Eppo help make your life easier by providing you easy-to-
understand dashboards that are refreshed nightly. So you can grab your cup
of coffee, open up your experiment dashboard, check on your experiments,
and potentially start making decisions (or at least monitoring to make sure
you haven’t broken something).
Eppo
LaunchDarkly
Split
Statsig
Growthbook
Optimizely
Analyzing each of these platforms is beyond the scope of this article. Given
all the requirements for an AB testing platform outlined above, however, I
can confidently say that Eppo (even though I may be slightly biased because
I work there) is the best all-in-one platform for companies that have their
data centralized in a modern data warehouse (Snowflake, BigQuery,
Redshift, or Databricks) and are looking to run product tests on web or
mobile, including tests of ML/AI systems. Eppo provides a robust, global
feature flagging system, a fact/metric repository management layer,
advanced statistics, nightly computations of experiment results, detailed
experiment diagnostics, and a user-friendly opinionated UX that is easy to
use even for non-statisticians. If you’re just looking to run simple marketing
copy tests, then a tool like Optimizely is probably better for you though it’ll
be pretty expensive.
Recommended reading
There’s a lot out there to read about AB testing. Here are some of my
recommendations:
Eppo’s blog
Optimizely’s blog
312 2
Jonathan Fulton in Jonathan’s Musings Jonathan Fulton in The Storyblocks Tech Blog
Jonathan Fulton in The Storyblocks Tech Blog Jonathan Fulton in The Storyblocks Tech Blog
These four “clean code” tips will 50 Great Reads for Engineering
dramatically improve your… Leaders
engineering team’s
A few years ago at productivity
VideoBlocks we had a I’ve read many great books and articles over
major code quality problem: “spaghetti” logi… the years that have helped me grow as an…
in most files, tons of duplication, no tests Engineering leader. Below is a list, grouped
7 min read · Jun 16, 2017
and… 3 min read · Aug 21, 2018
by…
24K 74 725 4
See all from Jonathan Fulton See all from Jonathan’s Musings
495 5 1.2K 27
Lists
AL Anany CarlaZr
7.4K 234 25 2
299 1 704 4
Help Status Writers Blog Careers Privacy Terms About Text to speech Teams