Professional Documents
Culture Documents
APPENDIX
2
Metrics and the Statistics
behind A/B Testing
By Bryan Gumm
A s online marketers and product managers, we can choose to
optimize our users experience along several different metrics.
For example, a product manager of a subscription service might be
interested in optimizing retention rate (percent), and an online
marketer of an e-commerce site might focus on optimizing average
order value ($). While each of these is obviously valid, the statistics
behind A/B testing are slightly different for each. Before delving
into the nuances of each, we ll introduce a few core concepts.
p1 p (E.1)
p 1 96
n
where:
p1 p
p 1 96
n
0 509 1 0 509
0 509 1 96
1 000
0 509 0 031
Appendix 2 183
s2 (E.2)
x 1 96
n
where:
n 2
xi x (E.3)
s2 i 1
n 1
This formula says:
If your data is in Excel, you can compute the sample mean and
variance by using the average( ) and var( ) formulas, respectively.
184 APPENDIX 2
The curious reader might ask why the con dence inter-
val formula used when measuring an average is different from
the formula used when measuring a percentage. In fact,
they re exactly the same! To see this for yourself, try simpli-
fying the variance term (s2 ) when every data point (xi ) is
known to be either 1 or 0, which is the case when measuring
a percentage. (Hint: When xi is either 1 or 0, note that
xi 2 xi .)
For both the average age and the percentage of degree
holders measurements, there are three conditions that result in
a large con dence interval:
sp2 p1 p (E.4)
0.25
Sample Variance
0.20
0.15
0.10
0.05
0.00
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Percentage Point Estimate Value
b 1 (E.5)
2
Appendix 2 187
1 0 95
0 05
0 05
b 1 0 975
2
pB pA
Z
1 1 (E.6)
p1 p
nB nA
Appendix 2 189
Observed Differences
Test Statistic (E.7)
Standard Error
XB XA
p (E.8)
nB nA
nB 49 981 nA 50 332
XB 42 551 X A 42 480
pB 0 851 pA 0 844
We rst compute p.
42 511 42 480
p
49 981 50 332
85 031
p
100 313
p 0 8477
190 APPENDIX 2
0 851 0 844
Z
1 1
0 8477 1 0 8477
50 332 49 981
Z 3 09
Note that if you were following along on your own, you may
have ended up with a Z-value of 3.24. The difference is due to
rounding.
Based on this test statistic, we can compute the probability
that our observed difference ( 0.007) is due to random chance.
This value, called the p-value, is the area underneath the standard
normal (Z) distribution before or after a certain point. If the value
of Z is negative, we compute the value up to that point, and
multiply it by 2. If the value of Z is positive, we compute the value
after that point, and then multiply it by 2. The multiplication by 2
is the application of what is called a two-tailed test, which will be
clari ed later.
In Excel, it is simple to produce the two-tailed p-value using
this formula: NORMSDIST(-ABS(Z)) 2. For our value of 3.09,
this results in a p-value of 0.0021. This means that there is only a
0.2 percent chance that the difference we observed is due to
random chance. If not due to random chance, it must be due to the
differential experience. Put another way, there is a 99.8 percent
chance that version B increases conversion above version A. This
is the number many statistical packages, including Optimizely,
publish as the chance to beat original metric (though Opti-
mizely uses a one-tailed test).
The difference between a one- and two-tailed p-value is that
in the latter, we are curious to know whether the two metrics
Appendix 2 191
pA pB
Z
1 1 (E.9)
p1 p
nB nA
xB xA
t
1 1 (E.10)
sp
nB nA
where:
nB 1 s2B nA 1 s2A
sp (E.11)
nB nA 2
df nB nA 2 (E.12)
Appendix 2 193