You are on page 1of 2

Search Medium Write

Get unlimited access to the best of Medium for less than $1/week. Become a member

AB Testing 101
What I wish I knew about AB testing when I started my career

Jonathan Fulton · Follow


Published in Jonathan’s Musings · 17 min read · Aug 25

312 2

AB testing launches your product and company to the next level

I started my career as a software engineer at Applied Predictive Technologies


(APT), which sold multi-million dollar contracts for sophisticated AB testing
software to Fortune 500 clients and was acquired by Mastercard in 2015 for
$600 million. So I’ve been involved in AB testing since the beginning of my
career.

A few years later I was VP of Engineering at Storyblocks where I helped build


our online platform for running AB tests to scale our revenue from $10m to
$30m+. Next at Foundry.ai as VP of Engineering I helped build a multi-armed
bandit for media sites. Then a few years later I helped stand up AB testing as
an architect at ID.me, a company valued at over $1.5B with $100m+ in ARR.

So I should have known a lot about AB testing, right? Wrong! I was surprised
just how little I knew after I joined the AB testing platform company Eppo.
Below is a synthesis of what I wish I’d known about AB testing when I started
my career.

What is AB Testing?
First things first, let’s define what AB testing is (also known as split testing).
To quote Harvard Business Review, “A/B testing is a way to compare two
versions of something to figure out which performs better.”

Here’s a visual representation of an AB test of a web page.

Source: https://vwo.com/blog/ab-testing-examples/

The fake “YourDelivery” company is testing to figure out which variant is


going to lead to more food delivery orders. Whichever variant wins will be
rolled out to the entire population of users.

For the rest of the article, I’ll be assuming we’re working on AB testing a web
or mobile product.

Ok, so let’s define what goes into running an AB test:

Feature flagging and randomization: how you determine who sees what
variant (e.g., which experience or features)

Metrics: what we measure to determine which variant wins and that we


don’t inadvertently break anything in the process

Statistics: the fancy math way of determining whether or not a metric for
a given variant is better than another variant

Drawing conclusions: how we make decisions from the metrics and


statistics

Let’s walk through each of these systematically.

Feature flagging and randomization


Feature flagging is used to enable or disable a feature for a given user. And
we can think of each AB test as a flag that determines which variant a user
sees. That’s where randomization comes in.

Randomization is about “rolling the dice” and figuring out which variant the
user sees. It sounds simple but it’s actually complicated to do well. Let’s start
with the “naive” version of randomization to illustrate the complexity.

Naive version of randomization

A user hits the home page and we have a test we’re running on the main
copy, with variants control and test. The backend has to determine which
one to render to the user. Here’s how the randomization works for a user
visiting the home page:

1. Lookup a user’s variant from the database. If it exists, use the recorded
variant. If it doesn’t exist, then…

2. Roll the dice. Execute Math.random() (or whatever randomization function


exists for your language)

3. Assign the variant. Compare the number returned from Math.random() .


If it’s < 0.5, then assign the control variant. Otherwise assign the test
variant.

4. Save the variant to the database (so it can be looked up in step 1 the next
time the user visits the home page)

5. Use the variant to determine what copy to render on the home page

6. Render the home page

Simple enough. This is actually what we implemented at Storyblocks back in


2014 when we first started AB testing. It works but it has some noticeable
downsides:

1. Hard dependency on reading from and writing to a transactional


database. If you have a test running on your home page, there’s
potentially a lot of write volume going to your testing table. And if you
have two, three, four or more tests running on that page, then multiply
that load. The app is also trying to read from this table with every page
load so there’s a lot of read/write contention. We crashed our app a few
times at Storyblocks due to database locks related to this high write
volume, particularly when a new test is added to a high traffic page.

2. Primarily works with traditional architecture where the server renders


an HTML page. It doesn’t really work for single page or mobile apps very
well because every test you’re running would need a blocking network
call before it could render its content appropriately.

3. Race condition. If a user opens your home page multiple times at the
same time (say from a Chrome restore), then it’s indeterminant which
variant the user is going to see and have recorded. This is because the
lookups from step 1 may all return nothing, so each page load rolls the
dice and assigns a different variant. It’s random which one actually wins
the race to be saved to the database and the user is potentially served
different experiences.

Improved randomization via hashing

So how do we do randomization better? The simple answer: hashing. Instead


of simply rolling the dice using Math.random() , we hash the combination of
the experiment identifier and the user identifier using something like MD5,
which effectively creates a consistent random number for the
experiment/user combination. We then take the first few bytes and modulo
by a relatively large number (say 10,000). Then divide your variants across
these 10,000 “shards” to determine which variant to serve. (if you’re
interested in actually seeing some code for this, you can check out Eppo’s
SDK for it here). Here’s what that looks like in a diagram with 10 shards.

After you’ve computed the variant, you log the result, but instead of writing
to a transactional database, which is blocking, you write the result to a data
firehose (such as AWS Kinesis) in a non-blocking way. Eventually the data
makes its way into a table in your data lake/warehouse for analysis (often
called the “assignments” table).

Ok, so why do I need a feature flagging tool? Can’t I just implement this
hashing logic myself? Yes, you could (and we did at Storyblocks back in the
day) but there are some downsides

1. There’s still a dependency on a database/service to store the test


configuration. The naive implementation requires fetching the data each
time you need to randomize a user into a variant. This is challenging for
SPAs and mobile apps.

2. There’s no good way to opt users in to a specific variant. This is often


necessary for testing or rollout purposes (e.g., someone should be able to
test each variant). The reason for this is that with only hashing, the
variant is fully determined by the experiment/user combination.

The answer to randomization: feature flagging

So what do we do? Feature flagging! I won’t go into it in full detail here, but
feature flagging solves these issues for us, by combining the best of both
worlds: the ability to opt specific groups of users into a test and the ability to
randomize everyone else. There’s a great Eppo blogpost that describes what
goes into building a global feature flagging service if you want to learn more.

A feature flagging tool determines which variant a user sees

Metrics
Metrics are probably the easiest part of AB testing to understand. Each
business or product typically comes with its own set of metrics that define
user engagement, financial performance and anything else you can measure
that will help drive business strategy and decisions. For Storyblocks, a stock
media site, that was 30-day revenue for a new signup (financial), downloads
(engagement), search speed (performance), net promoter score (customer
satisfaction) and many more.

The naive approach here is simply to join your assignments table to other
tables in your database to compute metric values for each of the users in
your experiment. Here are some illustrative queries:

SELECT a.user_id, a.variant, SUM(p.revenue) AS revenue


FROM assignments a
JOIN purchases p
ON a.user_id = p.user_id
WHERE a.experiment_id = 'some-experiment'
AND p.purchased_at >= a.assigned_at

SELECT a.user_id, a.variant, COUNT(*) AS num_page_views


FROM assignments a
JOIN page_views p
ON a.user_id = p.user_id
WHERE a.experiment_id = 'some-experiment'
AND p.viewed_at >= a.assigned_at

-- etc.

This becomes cumbersome for a few reasons:

1. As your user base grows, the ad hoc joins to your assignments table
become repetitive and expensive

2. As the number of metrics grows, the SQL to compute them becomes hard
to manage (just imagine having 50 or even 1000 of these)

3. If the underlying data varies with time, it becomes hard to reproduce


results

So to scale your AB testing, you need a system with the following:

1. The ability to compute the assignments for a particular experiment once


per computation cycle

2. A repository to manage the SQL defining your metrics

3. An immutable event or “fact” layer to define your underlying primitives


to compute your metrics

Let me explain the event/fact layer in more detail. A critical aspect to making
metrics easily reproducible and measurable is to base them on events or
“facts” that occur in the product or business. These should be immutable
and have a timestamp associated with them. At Storyblocks those facts
included subscription payments, downloads, page views, searches and the
like. The metric for 30-day revenue for a new signup is simply an operation
(sum) on top of a fact (subscription payments). Number of searches is simply
a count of the number of search events. And so on. A company like Eppo
makes these facts and other definitions a core part of your AB testing
infrastructure and also provides the capabilities for computing assignments
once and building out a fact/metric repository.

An important aspect of configuring an experiment is defining primary and


guardrail metrics. The primary metric for an experiment is the metric most
closely associated with what you’re trying to test. So for the homepage
refresh of YourDelivery where you’re testing blue vs red background colors,
your primary metric is probably revenue. Guardrail metrics are things that
you typically aren’t trying to change but you’re going to measure them to
make sure you don’t negatively impact user experience. Stuff like time on
site, page views, etc.

Statistics
Ok, statistics. This is the hardest part for someone new to AB testing to
understand. You’ve probably heard that we want a p-value to be less than
0.05 for a given metric difference to be statistically significant but you might
not know much else. So I’m going to start with the naive approach that you
can find in a statistics 101 textbook. Then I’ll show what’s wrong with the
naive approach. Finally, I’ll explain the approach you should be taking.
There will also be a bonus section at the end.

The naive approach: the Student t-test

Let’s assume we’re running the home page test for YourDelivery shown
above, with two variants control (blue) and test (red) with an even 50/50 split
between them. Let’s also assume we’re only looking at one metric, revenue.
Every user that visits the home page will be assigned to one of the variants
and then we can compute the revenue metric for each user. How do we
determine if there’s a statistically significant difference between test and
control? The naive approach is simply to use a Student t-test to check if
there’s a statistical difference. You compute the mean and standard deviation
for test and control, plug them into the t-statistic formula, compare that
value to a critical value you look up, and voila, you know if your metric, in
this case revenue, is statistically different between the groups.

Let’s dive into the details. The formula for the classic t-statistic is as follows:

t-statistic

Variable definitions in the formula are as follows:

To look up the critical value for a given significance level (typically 5%), you
need to know the degrees of freedom. However for large sample sizes that
we typically have when we’re AB testing, the t-distribution converges to the
normal distribution so we can just use that to look up the critical value. The
parameters for that normal distribution under the null hypothesis (i.e. there
is no difference between the groups) are:

At Storyblocks this is the approach we used. Since we wanted to track how


the test was performing over time, we would plot the lift and p-value over
time and use that for making decisions.

Example plot of lift and p-value over time

What’s wrong with the naive approach

The naive approach seems sound, right? After all, it’s following textbook
statistics. However there are a few major downsides:

1. The “Peeking Problem”. The classic t-test only guarantees its statistical
significance if you look at the results once (e.g., a fixed sample size).
More details below.

2. P-values are notoriously prone to mis-interpretation, have arbitrary


thresholds (e.g., 5%), and do not indicate effect size.

3. Absolute vs relative differences. The classic t-test looks at absolute


differences instead of relative differences.

The Peeking Problem

Using the naive t-test approach, we thought we were getting a 5%


significance level. However, the classic t-test only provides the advertised
statistical significance guarantees if you look at the results once (in other
words, you pre-determine a fixed sample size). Evan Miller writes a great
blog post about this problem that I highly recommend reading to understand
more. Below is a table from Evan’s blog post illustrating how bad the peeking
problem is.

Illustration of how peeking impacts significance

So if you’re running a test for 2+ weeks and checking results daily, then to get
a true 5% significance, you need to raise your significance to be ≤ 1%. That’s
a pretty big change and represents at least a full standard deviation of
difference from the naive approach.

The approach you should take

Ok, now that we know some pitfalls of the naive approach, let’s outline key
aspects of the way we should approach the statistics for our AB testing (I’ll
include more info about each below the list in separate sections).

1. Relative lifts. Look at relative lifts instead of absolute differences.

2. Sequential confidence intervals. These confidence intervals give you


significance guarantees that hold across all of time, so you can peek at
results as much as you want, and they’re easier to interpret than p-values

3. Controlled-Experiment Using Pre-Experiment Data (CUPED). We can


actually use sophisticated methods that leverage pre-experiment data to
reduce our variance, thus shrinking our confidence intervals and
speeding up tests.

(1) Relative lifts

The rationale behind relative lifts is straight forward: we typically care about
relative changes instead absolute changes and they’re easier to discuss. It’s
easier to understand a “5% increase in revenue” compared to a “$5 increase
in revenue per user”.

How does the math change for relative lifts? I’m going to quote from Eppo’s
documentation on the subject. First, let’s define relative lift:

Relative lift and related variables

From the central limit theorem, we know that the treatment and control
means are normally distributed for large sample sizes. This allows us to
model the relative lift as a normal distribution with the following
parameters:

Relative lift is normally distributed

Ok, that’s somewhat complicated. But it’s necessary to compute the


sequential confidence intervals.

(2) Sequential confidence intervals

First, let’s start with the confidence interval using a visual representation
from an Eppo experiment dashboard:
Confidence interval for a single metric

So you can see that the “point estimate” is a 5.9% lift, with a confidence
interval of ~2.5% on either side representing where the true relative lift
should be 95% (one minus the typical significance of 5%) of the time. These
are much easier for non-statisticians to interpret than p-values — the visuals
really help illustrate the data and statistics together.

So what are sequential confidence intervals? Simply put, they’re confidence


intervals that hold to a certain confidence level over all of time. They solve
the “peeking problem” so you can look at your results as often as you want
knowing that your significance level holds. The math here is super tricky so
I’ll simply refer you to Eppo’s documentation on the subject if you’re
interested in learning more.

(3) Controlled-Experiment Using Pre-Experiment Data (CUPED)

Sequential confidence intervals are wider than their fixed sample


counterparts, so it’s harder for metrics to reach statistical significance when
using sequential confidence intervals. Enter “Controlled-Experiment Using
Pre-Experiment Data” (commonly called CUPED), a method for reducing
variance by using pre-experiment data. In short, we can leverage what we
know about user behavior before an experiment to help predict the relative
lift more accurately. Visually, it looks something like the following:

Source: https://docs.geteppo.com/statistics/cuped/

The math is complicated so I won’t bore you with the details. Just know that
powerful AB testing platforms like Eppo provide CUPED implementations
out of the box.

Bonus material — simplifying computation

While I didn’t fully write out the math for sequential confidence intervals,
know that we need to compute the number of users, the mean, and the
standard deviation of each group, treatment and control, and we can plug
those in to the various formulas.

First, the means are relatively simple to compute:

The standard deviation is slightly harder to compute but is defined as


follows:

As you can see, we must first compute the mean and then go back and
compute the standard deviation. That’s computationally expensive because it
requires two passes. But there’s a reformulation we can employ to do the
computation in one pass. Let me derive it for you:

Ok, that looks pretty complicated. The original formula seems simpler.
However you’ll notice we can compute these sums in one pass. In SQL it’s
something like:

SELECT count(*) as n
, sum(revenue) as revenue
, sum(revenue * revenue) as revenue_2
FROM user_metric_dataframe

So that’s great from a computation stand point.

Bonus material — Bayesian statistics

Perhaps you’ve heard of Bayes’ theorem before but you’ve likely not heard of
Bayesian statistics. I certainly never heard about it until I arrived at Eppo. I
won’t go into the details but will try to provide a brief overview.

Let’s start with Bayes’ theorem:

In Bayesian statistics, you have a belief about your population and then the
observed data. Let’s simplify this to “belief” and “data” and write Bayes’
theorem slightly differently.

So basically you use the likelihood to update your prior, giving you the
posterior probability (ignoring for a moment the normalization factor, which
is generally hard to compute).

Why is this methodology potentially preferred if you have a small sample


size? Because you can set your prior to be something that’s relatively
informed and get tighter confidence intervals than you would with classical
frequentist statistics. Referring back to the original example, you could say
that you expect the relative difference between test (red) and control (blue) is
normally distributed with standard deviation of 5% (or something like that,
it’s a bit up to you to set your priors).

I totally understand that’s hard to follow if you have no knowledge of


Bayesian statistics. If you want to learn more, I recommend picking up a
copy of the book Bayesian Statistics the Fun Way. You could also read through
the sections of Eppo’s documentation on Bayesian analysis and confidence
intervals.

Drawing conclusions
Drawing conclusions is the art of AB testing. Sometimes the decision is easy:
diagnostics are all green, your experiment metrics moved in the expected
direction and there were no negative impacts on guardrail metrics. However,
studies show that only around 1/3 of experiments produce positive results. A
lot of experiments might look similar to the report card below for a “New
User Onboarding” test:

Metric Acontrol © Lift -20% -10% 0% 10% 20%

TotalUpgradestoPaidPlan
1.76 1.87 5.90%
Primary

~[Funnel]Appopenstocreationsconversion(5daysfromfirstevent) 12.27% 10.98% -10.51%

OTotalrevenue 36.74 37.04 0.82%

ONetsubscriptions 1.45 1.50 3.57%

OTotalDowngrades 0.32 0.35 9.18%

3+appopenswithin2weeks 0.033 0.031 -5.49%

OSitecreations 1.10 0.99 ©-9.72%

~Coreusermetrics

Metric
Acontrol Lift -20% -10% 0% 10% 20%

OTotalrevenue 36.74 3704 0.82%

OSitecreations 1.10 099 ©-9.72%

|5D7RetentionAppOpens 0.020 0.021 2.36%

New user onboarding experiment report card

The primary metric “Total Upgrades to Paid Plan” is up ~6% while there are
some negative impacts such as “Site creations” being down ~10%. So what do
you do? Ultimately, there’s no right answer. It’s up to you and your team to
make the tough calls in situations like this.

In addition to experiment report cards, it’s important to look at experiment


diagnostics to make sure the underlying data is in good shape. A very
common problem with AB testing is what’s called “sample ratio mismatch”
or SRM, which is just a fancy way of saying that the number of users in test
and control don’t match what’s expected. For instance you might be running
a 50/50 test but your data is showing 55/45. Here’s what an SRM looks like in
Eppo:

Example traffic chart for an SRM detected by Eppo

There’s also a variety of other ways your data could be off: one or more of
your metrics may not have data; there could be an SRM for a particular
dimension value; there may not be any assignments at all; there might be an
imbalance of pre-experiment data across variants; and more.

Tools like Eppo help make your life easier by providing you easy-to-
understand dashboards that are refreshed nightly. So you can grab your cup
of coffee, open up your experiment dashboard, check on your experiments,
and potentially start making decisions (or at least monitoring to make sure
you haven’t broken something).

Tools and platforms


While you might have initially thought that building an AB testing platform
is relatively straight forward, I hope I illustrated that doing it well is
extremely challenging. Everything from building a feature flagging tool, to
constructing a metrics repository, to getting the stats right, to actually
computing the results on a nightly basis, there’s a lot that goes into a robust
platform. Thankfully, you don’t need to build one from scratch. There are a
variety of tools and platforms that help make AB testing easier. Below is a list
of some relevant ones:

Eppo

LaunchDarkly

Split

Statsig

Growthbook

Optimizely

Analyzing each of these platforms is beyond the scope of this article. Given
all the requirements for an AB testing platform outlined above, however, I
can confidently say that Eppo (even though I may be slightly biased because
I work there) is the best all-in-one platform for companies that have their
data centralized in a modern data warehouse (Snowflake, BigQuery,
Redshift, or Databricks) and are looking to run product tests on web or
mobile, including tests of ML/AI systems. Eppo provides a robust, global
feature flagging system, a fact/metric repository management layer,
advanced statistics, nightly computations of experiment results, detailed
experiment diagnostics, and a user-friendly opinionated UX that is easy to
use even for non-statisticians. If you’re just looking to run simple marketing
copy tests, then a tool like Optimizely is probably better for you though it’ll
be pretty expensive.

Recommended reading
There’s a lot out there to read about AB testing. Here are some of my
recommendations:

Trustworthy Online Controlled Experiments

Evan Miller’s blog

Bayesian Statistics the Fun Way

Eppo’s blog

Optimizely’s blog

Udacity’s AB testing course (not technically a read, but still useful)

And that’s a wrap. Thanks for sticking around folks!

Ab Testing Analytics Experimentation Software Development Eppo

312 2

Written by Jonathan Fulton Follow

10.5K Followers · Editor for Jonathan’s Musings

Engineering at Eppo. Formerly SVP Product & Engineering at Storyblocks, McKinsey


consultant, software engineer at APT. Catholic, husband, father of three.

More from Jonathan Fulton and Jonathan’s Musings

Jonathan Fulton in Jonathan’s Musings Jonathan Fulton in The Storyblocks Tech Blog

The Great Tech Deflation Web Architecture 101


How the tech industry could suffer massive The basic architecture concepts I wish I knew
layoffs when I was getting started as a web…
developer
6 min read · Sep 6 11 min read · Nov 7, 2017

234 6 82K 166

Jonathan Fulton in The Storyblocks Tech Blog Jonathan Fulton in The Storyblocks Tech Blog

These four “clean code” tips will 50 Great Reads for Engineering
dramatically improve your… Leaders
engineering team’s
A few years ago at productivity
VideoBlocks we had a I’ve read many great books and articles over
major code quality problem: “spaghetti” logi… the years that have helped me grow as an…
in most files, tons of duplication, no tests Engineering leader. Below is a list, grouped
7 min read · Jun 16, 2017
and… 3 min read · Aug 21, 2018
by…

24K 74 725 4

See all from Jonathan Fulton See all from Jonathan’s Musings

Recommended from Medium

Arthur De Kimpe in Deezer I/O Keith McNulty

Rethinking Your Data Platform Start Your Day With Math


Documentation So That People… Extolling the virtues of 30 minutes of math
Actually Read
Some tips to It your documentation
structure every morning
and empower your employees to understan…
and leverage data without having to be data
9 min read · Sep 6
wizards. · 4 min read · 5 days ago

495 5 1.2K 27

Lists

General Coding Knowledge Stories to Help You Grow as a


20 stories · 351 saves Software Developer
19 stories · 379 saves

Coding & Development It's never too late or early to


11 stories · 179 saves start something
15 stories · 128 saves

AL Anany CarlaZr

The ChatGPT Hype Is Over—Now A/B Testing in Python: A Case


Watch How Google Will Kill… Study Comparing Two Marketing…
ChatGPT.
It never happens instantly. The business Campaigns
I. Introduction
game is longer than you know.

· 6 min read · Sep 1 8 min read · Mar 28

7.4K 234 25 2

Josh Poduska in Towards Data Science Bex T. in Towards AI

LLM Monitoring and Observability 3 Best (Often Better) Alternatives


A Summary of Techniques and Approaches To Histograms
for Responsible AI Avoid the most dangerous pitfall of
histograms

10 min read · 4 days ago · 10 min read · Sep 12

299 1 704 4

See more recommendations

Help Status Writers Blog Careers Privacy Terms About Text to speech Teams

You might also like