Calculating Sample Sizes for A/B Tests to Achieve Desired Power

On planning an A/B test to achieve a certain power
These calculations show you the minimum sample size (N) needed for a test to have a certain amount of power. N is total number of users, the sum of those in the treatment and the control. The power of a test is the ability of the test to "see" or detect a certain size effect. For example, a test may have 80% power to detect an effect of 2%. What this means is that if the actual treatment mean is 2% larger or smaller than the control mean, the hypothesis test that there is no difference (assuming alpha=0.05) has an 80% chance of being rejected. Of course, if the true means are more than 2% apart then it is even more likely the null hypothesis will be rejected.
When you are planning an experiment or parallel flight, there are two possible scenarios. Your objective may be to launch 1) the treatment is significantly better than the control or 2) the treatment is not significantly worse than the control (e.g. for software updates or strategic changes) In the first case you may need to "find" a positive effect as small as 1% in your OEC, or primary metric before you can laun In the second case you should set up a "no-go" decision point such that if the treatment mean is less than say 0.98*contro In the first, you want a high probability of detecting a Delta of 1%, in the second, of 2%.
Power calculations are most useful in the planning stage - prior to starting an experiment. They should be used cautiousl Also note that power calculations are necessarily approximations. The better your estimate of standard deviation or (in th In order to use the spreadsheets enter values in the appropriate yellow cells. If you have any questions on the use or interpretation of this spreadsheet, please contact rogerlon@microsoft.com For future updates to this calculator and other experimentation tools check out
http://exp-platform.com/
On using this calculator for A/B/C or MVT experiments
One note: these calculators are set up for simple A/B tests but can be used for one factor with many treatments or for MV is giving you the sample size needed for a "head-to-head" comparison of a treatment to the control. For example, if you have a control and two treatments, with each group receiving 1/3 of the total population the calculato one treatment. For an MVT it is highly recommended that the allocation of experimental units to the groups (trea for each treatment but more importantly for any interactions you are interested in. If you have some factors with for the factors with the most variants then the other factors will have sufficient power (provided all factors have a
n power
ertain amount of power.
e of being rejected. hypothesis will be rejected.
. Your objective may be to launch the change if
dates or strategic changes) rimary metric before you can launch the feature. mean is less than say 0.98*control mean the feature should not be launched.
nt. They should be used cautiously after an experiment is complete. ate of standard deviation or (in the case of a binary metric) proportion, the better the approximation.
ct rogerlon@microsoft.com
http://exp-platform.com/tools.aspx
r with many treatments or for MVTs. Just be aware that the calculator in those cases ment to the control. the total population the calculator will give you the sample size needed for the control plus rimental units to the groups (treatments) for each factor be equal to get sufficient power d in. If you have some factors with more treatments than others, do your power calculations power (provided all factors have an equal allocation to all treatment groups).
Calculations of power for OEC metrics that take more than t
The assumption is that the sample size will be large enough for the central limit theorem to hold - which will be the case f
Two alternatives to input of information - enter either the percent change (delta) or the actual change you want to be able A. Enter Percent Change (i.e. what percent change from the current average)
Case I: Assume the treatment and control have approximately the same number of observations. Assuming 2 groups (T/C) with approximately the same number in each group. Also assume a hypothesis test with a 5% Typ
Total sample size needed for given values of Average, StdDev and Pct change (Note: Delta and StdDev are in the original u Average D as Pct StdDev = Delta = 2.3 => input values in the yellow cells 1% Power 80% 90% 97.5% 5.7 N 1,965,369 2,579,546 3,807,902 0.023 N is total sample size, so split between treatment and control.
Case II: Assume the smaller group has a percent (q) of the total observations. Total sample size needed for given Average, StdDev, Pct change and q Average D as Pct StdDev = q= Delta = 2.3 => input values in the yellow cells 1% Power 80% 90% 97.5% 5.7 N 5,459,357 7,165,406 10,577,505 10% N is total sample size, so split between treatment and control. 0.023
B. Enter Actual Change (i.e. what change from the current average in the original metric) Case I: Assume the treatment and control have approximately the same number of observations. Assuming 2 groups (T/C), same number in each group, alpha=.05 for t-test for comparison of two means
Total sample size needed for given values of StdDev and Delta (Note: Delta and StdDev are in the original units of the met StdDev = Delta = 5.7 0.023 Power 80% 90% 97.5% N 1,965,369 2,579,546 3,807,902 N is total sample size, so split between treatment and control.
Case II: Assume the smaller group has a percent (q) of the total observations. Total sample size needed for given values of p, Delta and q StdDev = Delta = 5.7 0.023 Power N 80% 5,459,357 90% 7,165,406 97.5% 10,577,505
q=
10%
N is total sample size, so split between treatment and control.
Note: since the mean is not specified under B, the percent change for a specified delta cannot be computed from the info
ake more than two values.
o hold - which will be the case for all but the smallest sample sizes for online experiments.
ual change you want to be able to detect. In either case, the size of the treatment and control groups may be the same or different, the
a hypothesis test with a 5% Type I error rate.
and StdDev are in the original units of the metric)
of two means in the original units of the metric)
not be computed from the information given.
may be the same or different, the two cases below.
Calculations of power for OEC metrics that only take two val
The assumption is that the sample size will be large enough for the central limit theorem to hold - which will be the case fo
Two alternatives to input of information - enter either the percent change (delta) or the actual change you want to be able The average value for this type of metric is the proportion of 0s (or 1s) that occur in the control or treatment. This could be A. Enter Percent Change (i.e. what percent change from the current proportion of 0s or 1s)
Case I: Assume the treatment and control have approximately the same number of observations. Assuming 2 groups (T/C) with approximately the same number in each group. Also assume a hypothesis test with a 5% Typ Let p = the proportion of ones in the control (e.g. conversion rate.) Values of 0.0 and 1.0 are not permitted. Total sample size needed for given values of p and Delta (Note: Delta is the change from the existing proportion you need input values in the yellow cells p= 0.40 0.489898 Power 80% 90% 97.5% D as Pct 0.50% N 1,920,000 2,520,000 3,720,000 Delta = 0.002 N is total sample size, so split between treatment and control. Case II: Assume the smaller group has a percent (q) of the total observations. Total sample size needed for given values of sigma, Delta and q input values in the yellow cells p= 0.40 0.489898 Power 80% 90% 97.5% D as Pct 0.50% N 3,000,000 3,937,500 5,812,500 q= 20% N is total sample size, so split between treatment and control. Delta = 0.002
B. Enter Actual Change (i.e. change from the current proportion)
Case I: Assume the treatment and control have approximately the same number of observations. Assuming 2 groups (T/C), same number in each group, alpha=.05 for t-test for comparison of two means Let p = the percent of one value to the total in the control (e.g. percent of users that return in a certain time period.) Value Total sample size needed for given values of p and Delta (Note: Delta is the change from the existing percent you need to input values in the yellow cells p= 40% 0.489898 Power 80% 90% 97.5% Delta = 0.2% N 1,920,000 2,520,000 3,720,000 D as Pct 0.50% N is total sample size, so split between treatment and control. Case II: Assume the smaller group has a percent (q) of the total observations. Total sample size needed for given values of sigma, Delta and q p= 40% 0.489898 Power 80% 90% 97.5%
Delta = q= D as Pct
0.2% 20% 0.50%
N 3,000,000 3,937,500 5,812,500 N is total sample size, so split between treatment and control.
only take two values (e.g. 0 or 1)

to hold - which will be the case for all but the smallest sample sizes for online experiments.
ctual change you want to be able to detect. In either case, the size of the treatment and control groups may be the same or different, t ontrol or treatment. This could be a conversion rate, for example.
me a hypothesis test with a 5% Type I error rate. are not permitted.
the existing proportion you need to detect)
n of two means urn in a certain time period.) Values of 0% and 100% are not permitted.
the existing percent you need to detect)
s may be the same or different, the two cases below.

Calculating Sample Sizes for A/B Tests to Achieve Desired Power

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Calculating Sample Sizes for A/B Tests to Achieve Desired Power

Uploaded by

Copyright:

Available Formats

On planning an A/B test to achieve a certain power

On using this calculator for A/B/C or MVT experiments

ertain amount of power.

e of being rejected. hypothesis will be rejected.

. Your objective may be to launch the change if

Calculations of power for OEC metrics that take more than t

N is total sample size, so split between treatment and control.

ake more than two values.

a hypothesis test with a 5% Type I error rate.

and StdDev are in the original units of the metric)

of two means in the original units of the metric)

not be computed from the information given.

may be the same or different, the two cases below.

B. Enter Actual Change (i.e. change from the current proportion)

0.2% 20% 0.50%

only take two values (e.g. 0 or 1)

me a hypothesis test with a 5% Type I error rate. are not permitted.

the existing proportion you need to detect)

the existing percent you need to detect)

s may be the same or different, the two cases below.

You might also like