Professional Documents
Culture Documents
Bayesian AB Testing For Business Decisions
Bayesian AB Testing For Business Decisions
Abstract—Controlled experiments (A/B tests or randomized to stakeholders and, consequently, it is easier for them to make
field experiments) are the de facto standard to make data-driven business decisions.
decisions when implementing changes and observing customer In our company, we are continuously running A/B tests in
responses. The methodology to analyze such experiments should
be easily understandable to stakeholders like product and marketing, website, and product development. In this paper
marketing managers. Bayesian inference recently gained a lot we show the different experimentation scenarios and how
of popularity and, in terms of A/B testing, one key argument we model them in terms of Bayesian inference; the paper
is the easy interpretability. For stakeholders, “probability to be is organized as follows. Section II discusses the general
best” (with corresponding credible intervals) provides a natural structure of experiments and the main business questions
metric to make business decisions. In this paper, we motivate the
quintessential questions a business owner typically has and how to be answered with them. Section III reviews the basics
arXiv:2003.02769v1 [stat.AP] 5 Mar 2020
to answer them with a Bayesian approach. We present three of Bayesian inference and introduces relevant concepts and
experiment scenarios that are common in our company, how distributions needed in this paper. Section IV introduces
they are modeled in a Bayesian fashion, and how to use the the three business scenarios we tackle and the correspond-
models to draw business decisions. For each of the scenarios, ing models. Section V describes the methodology to draw
we present a real-world experiment, the results and the final
business decisions drawn. business decisions based on the introduced models. Sec-
Index Terms—randomized experiment, A/B testing, Bayesian tion VI presents three real-world cases where we used this
inference methodology to come to a business decision. We provide
empirical verification of the correctness by comparing the
I. I NTRODUCTION results to a state-of-the art commercial experimentation tool.
Controlled experiments (A/B tests or randomized field Section VII concludes this paper with a summary and the
experiments) are the de facto standard to make data-driven discussion on future work. The open source implementation
decisions; to drive innovation by the possibility of evalu- of the methodology is freely available to the community from
ating new ideas cheaply and quickly. Online experiments https://github.com/Avira/prayas.
are widely used to evaluate the impact of changes made in
marketing campaigns, software products, or websites. These II. E XPERIMENTS
days, basically every large technology company continuously
conducts experiments: Facebook for user interface changes
in their News Feed [1]; Google states that “experimentation
is practically a mantra; we evaluate almost every change 8%
Conversion
that potentially affects what our users experience.” [2]; and 50% Variant A
rate
Microsoft for optimizing Bing ads [3]. In all cases, the key
reason to conduct controlled experiments is the possibility to
establish a causal relationship (with high probability) between Visitors 50%
●●
●
seasonality or moves by the competition occur in both control ●
● ●
●●
PMF
PDF
● ●
and treatment. This means, that the difference in the measures 0.06 ● ●
●
●
●
5
● ●
● ●
● ●●
test in terms of the Frequentist inference or, as we motivate ●
●
●
●
●
●
●●●●
●
●
●
●
●
●
● ●
●●
●●
●
●●
0.00 ●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
● ●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●● 0
here, a model driven approach in terms of Bayesian inference. 0 25 50 75 100 0.00 0.25 0.50 0.75 1.00
In detail, we present a Bayesian approach to answer the
quintessential questions that typically arise in the mind of a ● Bin(100, 0.2) ● Bin(70, 0.8) Beta(1, 1) Beta(20, 15)
● Bin(100, 0.5) Beta(15, 100)
business owner:
1) What is the probability that variant B is better than
variant A? Fig. 2. (left) Probability mass function of the Binomial distribution
Bin(n, p) for different values of n and p. This distribution is a discrete
2) How much uplift in conversions will variant B bring? distribution, the line between the points is just to increase readability.
3) What is the risk of switching to variant B from vari- (right) Probability density function of the Beta distribution Beta(a, b) for
ant A? different values of a and b.
IV. E XPERIMENT SCENARIOS AND MODELS The prior distribution about the conversion rate is modeled as
In our company, majority of experiments fall into one of P (λi ) = Beta(ai , bi ), (3)
the three scenarios: 1) compare variants where the visitor has with the hyperparameters ai and bi defined in a way to
one option to choose, 2) compare variants where the visitor represent our prior belief. Fig. 2 (right) shows different prior
has multiple options to choose from, and 3) compare variants beliefs about the conversion rate: with ai = bi = 1 we have
where the visitor has multiple options to choose from but we an uniformed prior and all conversion rates are equally likely;
only observe the overall success. In the following we describe with ai = 15 and bi = 100 we define a prior where the
a model for each scenario. conversion rate is around 0.125 with small variance, meaning
that we have a strong prior knowledge; and with ai = 20 and
A. One option model
bi = 15 the conversion rate is around 0.6 with wider variance,
indicating that we have only weaker knowledge about the
conversion rate.
To compute the posterior, we leverage the fact that the
conjugate prior for a distribution belonging to binomial family
is in the beta family, and therefore the posterior distribution
is the same as the prior distribution with updated parameter
values [7]:
P (λi |Xi ) = Beta(ai + Ci , bi + (Ni − Ci )) (4)
The posterior P (λi |Xi ) gives the probability distribution of
all possible values of λi given the evidence Xi . We can
draw n random samples from the posterior to obtain a set
Yi = [yi1 , . . . , yin ] of possible values the conversion rate
Fig. 3. Example of one variant in case of a one option model experiment. The
website visitor has only the option to purchase one product. The purchase λi can take. The possible values for the revenue (or any
of the presented product is the event of interest for this variant. other value vi > 0) per visitor is defined as Zi = Yi · vi .
There are cases where businesses incur loss for each non-
The first model we describe is applicable in the basic conversion. One example is the pay-per-click model followed
scenario of experiments described in Section II and illustrated in online advertising industry where businesses must pay the
in Fig. 1; Fig. 3 shows a real-world variant of one of ad platform for each click generated regardless of conversion.
our experiments where the visitor has only the option of The above model can be adjusted to penalize the loss li of
purchasing one product. One option scenarios are formalized each non-conversion as Zi = Yi · (vi − li ) + (1 − Yi ) · (−li ).
as follows. a) Simulation of the Bayesian updating: To illustrate
The experiment consists of variants i with i = 1, . . . , V the Bayesian updating from prior to posterior when gathering
where V is the total number of variants. Each variant displays new data, we simulate a one option experiment and take a
only one option to the visitor. For a given variant i, Ni is closer lock at Variant 1. For the data X1 we draw N1 = 100
the total number of visitors that saw variant i (trials), and samples from a binomial distribution with the conversion rate
Ci the number of conversions (successes). Success is defined λ1 = 0.5. Starting from the priors shown in Fig. 2 (right),
by the occurrence of the event of interest, for example, the we look at the posteriors Y1 after seeing the first 10, 50, and
visitor clicks a button click or purchases a product. A success all 100 samples. Fig. 4 shows the simulation. In the first row,
also has a value vi such as revenue or customer lifetime we start with an uninformative prior, and after seeing all data
value associated and can be different for every variant. In the posterior is near to the true conversion rate. In the second
a frequentist approach, the conversion rate λi is the point row, we start with a strong prior knowledge of the conversion
0 10 50 100 experiments where visitors can choose from three different
product options. Multi-options scenarios are formalized as
10 follows.
Each variant i displays Ki different options, from which
Beta(1, 1)
5
the visitor can chose either one of the options or none.
The number of conversions is given by Ci = [c1i , . . . , cK i ]
i
0
successes for each option per variant. The revenue for each
success value is vi = [vi1 , . . . , viKi ] since each of the
multiple options can have a different value (e.g., due to
10
different prices per product). The conversion rate is defined
Beta(15, 100)
Posterior
as λi = [λ1i , . . . , λK
i
i +1
] with λ1i , . . . , λK
i
i
the probabilities
5 Ki +1
of choosing individual Ki options and λi the probability
of choosing none of the options. Since each visitor can
0
only choose from one of these Ki + 1 options, they are
mutually exclusive and therefore for every variant i it holds
PKi +1
10
that l=1 λli = 1.
Beta(20, 15)
Executing the experiment results in collected data for each
5 variant, Xi = {xi1 , . . . , xiNi } with xij = [x1ij , . . . , xK ij ] and
i
l
xij ∈ {0, 1} indicating which option the visitor chose. Since
0 we consider multiple events within a variant, the sequence
0 1 0 1 0 1 0 1
of conversions follows a multinomial distribution and the
generative process of the data is [8]:
Fig. 4. Illustration of the Bayesian updating. For simulated data, the posterior
is shown for the three priors defined in Fig. 2 (right) after seeing 10, 50, and P (Xi |λi ) = M ult(Ni , λi ) (5)
100 samples; column 0 shows the corresponding prior. The dashed vertical
line indicates the true conversion rate 0.5. The prior distribution of the conversion rate is modeled as
P (λi ) = Dir(a1 , . . . , aKi +1 ), (6)
rate being around 0.125; here 100 samples are not enough to
estimate the true conversion rate. In the third row, we start where a· are the hyperparameters.
with weaker prior knowledge of the conversion rate being Once again, leveraging the conjugacy relationship between
around 0.6, and in this case, the posterior is near the true multinomial and Dirichlet distributions, we compute the pos-
conversion rate after seeing all samples. terior distribution with updated parameter values as [7]:
P (λi |Xi ) =
B. Multi-options model (7)
Dir(a1 + c1 , . . . , aK+1 + (Ni − (c1 + . . . + cK ))
From this posterior distribution of conversion rates, we can
draw n random samples Yi = [yi1 , . . . , yin ] where each
element yij is the overall conversion rate over all options
available in the Variant i. Similarly to the one option model,
we can penalize the non-conversions, and we can compute a
revenue-based metric as the sum of element-wise multiplica-
tion of the values vi > 0 and yij .
C. Aggregated model
This scenario is a special case of the scenarios 1 and 2
where only the aggregated revenue and the sum of conversion
per variant are observed. This can occur because of the data
Fig. 5. Example of one variant in case of the multi-options model. The generation process or simply because there are too many
website visitor has multiple options to select, in this case to purchase three options within a variant making it cumbersome to model
different products. The purchases of the presented products are the events of
interest for this variant. it under Scenario 2. Some of our experiments have up to
81 different options the visitor can chose from, due to the
The second model we describe is the generalization of different products listed together with the options for license
Scenario 1. Scenario 1 assumes that there can be only one runtimes and the number of devices on which the product
possible event of interest in each variant. However, in real can be installed. This aggregated scenario is formalized as
world, it oft is too strict a restriction. For example, many follows.
online websites showcase multiple product purchase buttons For a given variant i, there are Ki unknown options. Ni is
in a single page. Fig. 5 shows a variant of one of our the number of visitors and Ci is the number of conversions
defined as in the multi-option model. The individual revenue B. Expected uplift
per option vi = [vi1 , . . . , viKi ] is unknown. Executing the The second motivating question we want to answer, once
experiment results in collected data Ci , Ni and the aggregated we have the best variant, is to determine the increase in
revenue si over all Ci successes PCi and implicitly over all measure we expect after its implementation. Given set of
unknown Ki options, i.e., si = j=1 sij with sij the revenue posterior samples Yi , the expected uplift of choosing Y1 over
of each success j. Y2 is defined as the mean (or the credible interval) of the
The posterior for the conversion rate λi can be obtained percentage increase:
by following the one option model and the therein defined
y1j − y2j
posterior in Equation 4. To get an estimate of the revenue per U (Y1 , Y2 ) = [ | i and j ∈ {1, . . . , n}] (12)
visitor vi we model the average revenue per visitor v̄i given y2j
the observed aggregated revenue si . For that we assume that As we are interested in the expected uplift compared to
sij follows an exponential distribution the control variant, we typically only compute this for all
treatment variants against the control variant. However, the
P (sij |v̄i ) = Exp(v̄i ), (8) equation can be extended to compare every variant Yi against
where v̄i is the scale parameter of the distribution [9]. each other.
The exponential distribution means that lower revenue has C. Expected loss
more probability of occurrence than higher revenue. This
assumption fits our observation of the visitors’ money spent Suppose Variant 1 has the highest probability to be the best
curve on our website. The prior distribution of v̄i is modeled but smaller than 1. Then, there is still a chance that the other
as variant is the true best performing one. In such a case we
want to know the risk of implementing Variant 1. Given set
P (v̄i ) = Gamma(αi , βi ) (9)
of posterior samples Yi , the expected loss when choosing Y1
Taking advantage of the conjugacy relationship between ex- over Y2 is the mean (or the credible interval) of:
ponential and gamma distributions, we compute the posterior y2j − y1j
distribution as L(Y1 , Y2 ) = [max ( , 0) | i and j ∈ {1, .., n}]
y1j
(13)
P (v̄i |si ) = Gamma(α + Ci , βi + si ) (10) Similar to the expected uplift, expected loss is typically
The expected value per visitor per variant is λi ∗ v̄i . Since we calculated between the the treatment variant and the control.
have the posterior distributions for both, λi and v̄i , we draw VI. B USINESS CASES
n random samples from P (λi |Xi ) for the conversion rate Yi ,
In this section we illustrate the application of Bayesian
n random samples from P (v̄i |si ) for the revenue.
inference and decision making while experimenting with
V. D ECISION MAKING different discounts on our product prices.
Every experiment has one measure of interest defined A. Single product discount test
beforehand, such as conversion rate or revenue per visitor, on
which the variants are evaluated. In Section IV, we described
a way of modelling the measures of interest and obtaining the
samples from the posterior distribution. Here, we describe the
method of comparing different posterior samples in order to
answer the motivating questions from Section II.