You are on page 1of 20

Vol.

6, 2021-09

Optimal Significance Level and Sample Size in Hypothesis Testing. 4. Tests


of Proportions

Hugo Hernandez
ForsChem Research, 050030 Medellin, Colombia
hugo.hernandez@forschem.org

doi: 10.13140/RG.2.2.20659.14883

Abstract

The optimal analysis of statistical tests based on the standardized decision values is presented
and discussed for testing proportions. The statistical analysis of proportions in binomial
distributions is challenging for two main reasons: 1) The variance of the distribution is a
function of the mean value, and 2) The proportion values are limited to the range [0,1]. Due to
the first issue, a universal default test resolution value cannot be used. Of course, a universal
default significance level is not recommended either. Thus, the test resolution must be
determined for each particular problem, depending on the size of the sample and on the value
of the proportion tested. For large samples, the minimum test resolution decreases but the
total test error increases beyond the viability limit of 50%. Instead of large samples, multiple
sub-groups of observations can be used in order to obtain a reliable conclusion. On the other
hand, when extreme probabilities are tested (close to 0 or 1) the normal approximation may
lead to out-of-range values. This issue can be solved by considering a logistic transformation of
the proportion values. Different examples of application are included to illustrate the optimal
analysis of proportions.

Keywords

ANOVA, Binomial Distribution, Decision Value, D-value, Hypothesis Testing, P-value, Resolution,
Sample Size, Significance Level, Statistical Tests, Test of Proportions, T-Test, Z-Test

1. Introduction

This paper is part of a series of reports discussing the optimal significance level and sample size
in statistical tests of hypotheses. The first part of this series [1] introduced the optimization
problems for the sample size (test cost minimization problem) and for the significance level

15/06/2021 ForsChem Research Reports Vol. 6, 2021-09 (1 / 20)


www.forschem.org
Optimal Significance Level and Sample Size in
Hypothesis Testing. 4. Tests of Proportions
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

(total test error minimization problem) applied to the tests of means: Z-test and T-test. The
concept of the decision value ( -value) was also presented as a more robust alternative to the
classical probability value ( -value). In the second part [2], a similar approach was used for the
tests of variances: -test and F-test. The third report [3] discussed the effects of sample size
on the statistical tests of hypotheses and introduced some corrections for the evaluation of
large samples. Particularly, the standardized decision value ( -value) was introduced. In this
opportunity, the optimal analysis of tests of proportions in binomial distributions is considered.

The binomial distribution evaluates the probability of successes in a sample of independent


experiments, considering a constant probability of success (also denoted as the true
proportion of the population). The binomial distribution is discrete because only a non-negative
integer value of successes can be obtained. For the same reason, the proportion of success in
the sample ( ) is also a discrete variable even though it is described by real values in the range
of [ ]. On the other hand, if we assume that represents the sample size obtained from an
infinite population, the true proportion of the population is a continuous variable taking any
real value in the range [ ].

The probability distribution of the possible number of successful events ( ) is [4]:

( )
( )
( )
(1.1)

where . In addition, the binomial distribution can be approximated by continuous


probability distribution functions including the normal distribution [5]:

( )
( )
( )
√ ( )
(1.2)

where . Notice that the continuous approximation implies a fictitious continuous and
unbounded variable . However, it provides a very good description of the binomial
distribution.

Considering both distribution models, the expected value and variance of the number of
successful events are:

( )
(1.3)
( ) ( )
(1.4)

15/06/2021 ForsChem Research Reports Vol. 6, 2021-09 (2 / 20)


www.forschem.org
Optimal Significance Level and Sample Size in
Hypothesis Testing. 4. Tests of Proportions
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

Now, since

(1.5)
the expected value and variance of the sample proportion are:

( )
( )
(1.6)
( ) ( )
( )
(1.7)
and the approximate probability density function is:
( )
( )

( )
√ ( )

(1.8)

Considering that the result of each experiment ( ) can be represented by a binary variable with
values (failure) or (success), the individual results can be related to the number of
successes in the sample ) as follows:
(

(1.9)
and therefore:
( )
( ) ( )
(1.10)
( )
( ) ( ) ( )
(1.11)

The relationship observed between the individual results ( ) and the sample proportion of
success ( ) indicates that the proportion can be considered as a sample average of . Now, if
the sample size is large enough (larger than 20 [6]), then the behavior of (as a sample
average of ) can be considered normal, and the tests of means can be used as tests of
proportions.

15/06/2021 ForsChem Research Reports Vol. 6, 2021-09 (3 / 20)


www.forschem.org
Optimal Significance Level and Sample Size in
Hypothesis Testing. 4. Tests of Proportions
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

While this approach is mathematically correct, this might lead in practice to erroneous results
because the proportion cannot be less than nor larger than . Thus, when the probability of
success is close to or close to , the validity of this approach might be compromised.
Assuming that a normal distribution can be safely truncated within approximately standard
deviations around the mean ( confidence), then the following approximate conditions
guarantee the validity of the normal approximation:

( )
(1.12)
( )
(1.13)

Two different approaches for testing the proportion will be considered: The Z-test approach
and the T-test approach. In principle, the Z-test can be used for testing the proportion
considering that the variance is directly related to the mean value of the proportion, as can be
seen in Eq. (1.7), and thus, it can be assumed to be known. On the other hand, the T-test for an
unknown variance can also be employed. The T-test can also be used when the data is sub-
grouped. In this case, the proportion observed in each sub-group can be considered as an
individual observation of an arbitrary continuous variable, and a conventional T-test is then
used. When sub-groups are considered, a logistic transformation of the proportion value also
allows overcoming out-of-range values, as long as a deterministic result ( , or ) is not
observed in any of the sub-groups. The different cases are described in the following Sections.

2. Z-Test of Proportions

The set of hypotheses considered in a right-tail one-sample test of proportions is the following:

(2.1)
where is the hypothetical probability of success.

By assuming that the sample proportion behaves as the sample average of binary results ( ),
then Eq. (2.1) can be alternatively expressed as:

(2.2)

basically representing an statistical test of means which has been previously discussed [1].

15/06/2021 ForsChem Research Reports Vol. 6, 2021-09 (4 / 20)


www.forschem.org
Optimal Significance Level and Sample Size in
Hypothesis Testing. 4. Tests of Proportions
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

A Z-test can be used for testing Eq. (2.2) as long as the variance is known. If the null hypothesis
were true, then:
( )
(2.3)
and the corresponding statistic becomes:

√ ( )

(2.4)

However, if the alternative hypothesis were true ( ), then the statistic will be:

√ ( )

(2.5)
Of course, is unknown but it can be approximated using as an estimate:

√ ( )

(2.6)
This means that the Z-test can be used only as an approximation of the proportion test.

The success probabilities of the null and alternative hypotheses can be related by the Cohen
distance for the proportions, as follows:

| |
√ ( )
(2.7)

Figure 1 shows the behavior of the Cohen distance for the proportions as a function of the true
success probability, for different hypothetical success probabilities.

Since the standard deviation is a function of the true success probability, the true Cohen
distance shows a highly nonlinear behavior. For that reason, a constant Cohen resolution for
the test of proportions is not advisable. Instead, the Cohen resolution for the test can be
obtained from the target difference in proportion ( ), as follows:

√ ( )
(2.8)

15/06/2021 ForsChem Research Reports Vol. 6, 2021-09 (5 / 20)


www.forschem.org
Optimal Significance Level and Sample Size in
Hypothesis Testing. 4. Tests of Proportions
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

Figure 1. Cohen distance for the proportion ( ) as a function of the true success probability of
the data ( ) and the hypothetical success probability ( ). Blue: . Purple: .
Green: . Orange: . Red: .

This target difference should be larger or equal than the measurement resolution of the
proportion, as follows:

(2.9)

If there is no clear target difference for a test, the minimum measurement resolution is the
most reasonable choice.

The optimal sample size and optimal significance level can then be obtained following the
procedure used for the conventional Z-test [1], considering the Cohen test resolution (Eq.
2.8). It is also possible to calculate standardized decision values ( ) [3] for drawing a
conclusion about the hypotheses (2.1) or (2.2).

A similar analysis applies for the other cases: left-tail and two-tailed Z-tests.

3. T-Test of Proportions

The previous approach for testing proportions using the Z-test can only be considered as an
approximation because, rigorously speaking, we do not know the true population variance of
the data. Thus, the T-test of proportions is a more realistic approach. Of course, for large
samples ( elements) the T-test is practically equivalent to the Z-test [3].

Let us consider different alternatives for performing this T-test. The first alternative is
considering all binary observations as a single group, just as it was considered in the previous
Z-test approach. The second alternative consists in dividing all the binary observations in a

15/06/2021 ForsChem Research Reports Vol. 6, 2021-09 (6 / 20)


www.forschem.org
Optimal Significance Level and Sample Size in
Hypothesis Testing. 4. Tests of Proportions
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

mutually exclusive and exhaustive set of sub-groups, and then considering the proportion of
each sub-group as an element of the sample to be analyzed. Ideally, each sub-group should
represent a different replication of the experiment. The third alternative is also based on sub-
groups, but using the logistic transformation [7] of the proportion as response variable.

3.1. Single-Group T-Test of Proportions

In this approach, the variance of a sample of binary observations is used as an estimate of the
population variance. The sample variance for the binary observations is:

∑ ( )
( ) ( )
(3.1)
The one-sample T-test statistic ( ) for the proportion then becomes:

√ ( )

(3.2)
with degrees of freedom.

For paired two-sample T-tests, the statistic becomes (assuming a hypothetical zero difference
in proportions):

⁄√
(3.3)
where

∑ ( )

(3.4)

and . In this case, the standard deviation of the difference can be used for estimating
the test resolution:

(3.5)

For unpaired two-sample T-tests (assuming a hypothetical zero difference in proportions), the
statistic can be obtained as follows:

15/06/2021 ForsChem Research Reports Vol. 6, 2021-09 (7 / 20)


www.forschem.org
Optimal Significance Level and Sample Size in
Hypothesis Testing. 4. Tests of Proportions
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

( ) ( )

(3.6)
with
( ) ( )
( )

( ( )) ( ( ))
( ) ( )
(3.7)
and test resolution:

( ) ( )

(3.8)

The optimal sample size and optimal significance level can then obtained following the
procedure used for the conventional T-test [1], followed by the determination of the
standardized -value [3] for drawing a conclusion.

3.2. Multiple-Group T-Test of Proportions

Sometimes, using a single large sample for the test of proportion is not necessarily the best
approach. Large samples may result in extremely small probability values making the analysis
unreliable [3].

Let us consider that a sample containing observations is randomly split into different sub-
groups (exhaustive and mutually exclusive) of the same size . That is:

(3.9)
Now, each sub-group ( ) will have a proportion value ( ) given by:

(3.10)
with a proportion measurement resolution .

The overall proportion value ( ̅ ) will then be given by:

15/06/2021 ForsChem Research Reports Vol. 6, 2021-09 (8 / 20)


www.forschem.org
Optimal Significance Level and Sample Size in
Hypothesis Testing. 4. Tests of Proportions
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

̅ ∑

(3.11)
with sample variance:
∑ ( ̅)

(3.12)
The one-sample T-test statistic for the proportion then becomes:

̅

(3.13)

with degrees of freedom, and the corresponding set of hypotheses to be tested is (using
the right-tail test as reference):

(3.14)

Of course, the number of sub-groups should be at least in order to guarantee a normal


behavior of the sub-group proportion value.

Sub-groups of different sample size are also possible, but this will lead to different
measurement resolutions. It is therefore advisable using sub-groups of the same size when
possible.

For two-sample tests the sub-group proportion data can also be treated as conventional
normal random variables (for samples larger than or when the validity of the normal
assumption has been confirmed).

Let us recall that the minimum total error of a test is a function of the sample size and test
resolution [1]. For this approach, the sample size is the number of sub-groups and
the minimum test resolution is a function of the sub-group size:

√ ( )
(3.15)

Using Eq. (3.15) the minimum total test error can be estimated for different tests and sample
proportion values. As an example, the minimum total test error obtained for ,
considering different sub-group sizes is graphically represented in Figure 2, for both one-tailed

15/06/2021 ForsChem Research Reports Vol. 6, 2021-09 (9 / 20)


www.forschem.org
Optimal Significance Level and Sample Size in
Hypothesis Testing. 4. Tests of Proportions
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

and two-tailed T-tests. If the number of sub-groups is small, the total test error increases, even
making the test inviable. If the number of sub-groups is large, the test error drops to practically
zero, making the test inconclusive or unreliable. Thus, for each particular proportion value and
total number of observations, an optimal range of sub-groups can be found. As a general rule
of thumb, a minimum total error between and can be considered acceptable
during the design of a test.

Figure 2. Minimum total test error as a function of the number of sub-groups ( ) for testing
different proportion values. Left: One-tailed T-test. Right: Two-tailed T-test.

3.3. Logistic approach

Even when sub-groups are used, if the proportion values are extremely low (close to zero) or
extremely high (close to one) then the normal approximation may fail since out-of-range values
are possible. This situation takes place (assuming a safe truncation within approximately
standard deviations around the mean) when:

∑ ( ̅) ( ̅ ̅)

( )
(3.16)
In this case, the following logistic transformation can be used [7]:

( )

(3.17)

The monotonic non-linear transformation (3.17) may take any arbitrary value in the range
( ), thus avoiding fictitious out-of-range values. For samples larger than this
transformation has no relevant influence on the validity of the T-test. However, for smaller
samples, the normality of the transformed variable must always be verified.

15/06/2021 ForsChem Research Reports Vol. 6, 2021-09 (10 / 20)


www.forschem.org
Optimal Significance Level and Sample Size in
Hypothesis Testing. 4. Tests of Proportions
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

The set of hypotheses to be tested for a right-tail test is now:

(3.18)
where
( )

(3.19)

4. Examples

4.1. Verifying the Seed Germination of Grass

A farmer has been planting grass seed from a new supplier which promises -
§
germination , but is not convinced of the results obtained. In order to verify this information,
the following set of hypotheses is proposed:

(4.1)

This means that if the germination probability is found to be less than , then the quality
promise is not truly fulfilled by the supplier.

According to Eq. (1.13), the minimum sample size required for the viability of the normal
assumption for this test is:

⌈ ( )⌉
(4.2)

For designing the test a proportion measurement resolution of is considered (smaller


differences are considered negligible by the farmer). This means:

√ ( )
(4.3)

Considering also a maximum test error of , the optimal sample size for a Z-test is :

§
Percent germination is one of many possibilities for analyzing seed germination [8].

15/06/2021 ForsChem Research Reports Vol. 6, 2021-09 (11 / 20)


www.forschem.org
Optimal Significance Level and Sample Size in
Hypothesis Testing. 4. Tests of Proportions
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

optsize.mean.test(test="Z",tails=1,dr=1/30)
v test tails dr CR alpha testerror
9277 Z 1 0.03333333 3.644324e+23 0.04743786 0.09487572

If the variance is assumed unknown (T-test), then the sample size obtained under the same
conditions is :

optsize.mean.test(test="T",tails=1,dr=1/30)
v test tails dr CR alpha testerror
11292 T 1 0.03333333 4.432928e+22 0.04371696 0.08743392

In both cases, the optimal size is larger than the minimum sample required for a viable normal
assumption.

The farmer decides to perform both tests independently. The first experiment with seeds
resulted in successful germinations ( ). The second experiment with seeds
resulted in successful germinations ( ). Even though the germination proportions
obtained were less than , an statistical analysis will be performed in order to support the
conclusion.

The binary data obtained were used as a single group for testing the hypotheses in Eq. (4.1)
with a Z-test for the first experiment and a T-test for the second experiment. The results are
the following:

opt.mean.test(Exp1,mu=0.9,test="Z",tails=1,dr=1/30)
decision D Dcr Ds P alpha v test tails dr error
Ho is rejected 0.0021 0.0013 1.6544 1.75e-10 0.0474 9277 Z 1 0.0333 0.0948

opt.mean.test(Exp2,mu=0.9,test="T",tails=1,dr=1/30)
decision D Dcr Ds P alpha v test tails dr error
Inconclusive 0.0010 0.0018 0.5717 4.95e-7 0.04372 11291 T 1 0.0333 0.08744

As it can be seen, the Z-test rejected the null hypothesis (with ) whereas the T-test
was inconclusive ( ) despite (or probably because of) the large sample used.

In order to confirm the results, a multiple sub-group approach is considered by the farmer.
Using a similar amount of seeds ( seeds), the minimum total test error for different
numbers of sub-groups is shown in Figure 3. Only two possibilities are found in the
recommended range of total test error: groups of seeds each ( test error) and
groups of seeds each ( test error). The second option is selected by the farmer
due to the lower test error.

15/06/2021 ForsChem Research Reports Vol. 6, 2021-09 (12 / 20)


www.forschem.org
Optimal Significance Level and Sample Size in
Hypothesis Testing. 4. Tests of Proportions
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

Figure 3. Minimum total test error as a function of the number of sub-groups ( ) for testing a
proportion of using seeds. Green shade: Minimum total test error in the range
between and .

The proportion of each sub-group is calculated and the values obtained are used for
testing the hypotheses (4.1). The average proportion obtained was with a standard
deviation of . The optimal T-test results obtained are the following:

opt.mean.test(Exp3,mu=0.9,test="T",tails=1,dr=(1/16)/sqrt(0.9*0.1))
decision D Dcr Ds P alpha v test tails dr error
Ho is rejected 0.03115 0.01012 3.0765 2.1e-11 0.0057 624 T 1 0.2083 0.0115

Notice that even though the average proportion was similar to the data used in the Z-test, the
standardized decision value is now much larger ( ), just by using a sub-
group approach. Even more, if fewer groups are used, let us say for example groups of
elements each, the test error increases to but the number of observations is reduced to
. An experiment under these conditions yields an average germination of with
standard deviation and the following results:

opt.mean.test(Exp4,mu=0.9,test="T",tails=1,dr=(1/16)/sqrt(0.9*0.1))
decision D Dcr Ds P alpha v test tails dr error
Ho is rejected 0.02837 0.01526 1.8589 2.02e-5 0.0481 274 T 1 0.2083 0.0962

The standardized -value obtained with this approach, (using seeds), is


larger than the value obtained in the Z-test of a single sample, (using seeds),
confirming that the sub-group approach is a cost-effective alternative for testing proportions.
̅
Finally, by checking Eq. (3.16) it is found that and therefore, the normal assumption

does not lead to significant out-of-range predictions. This also means that a logistic
transformation is not required. Then, it can be safely concluded that the experimental results
support the suspicions of the farmer: The grass seed supplier does not meet the promised
percentage of germination.

15/06/2021 ForsChem Research Reports Vol. 6, 2021-09 (13 / 20)


www.forschem.org
Optimal Significance Level and Sample Size in
Hypothesis Testing. 4. Tests of Proportions
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

4.2. Quality Improvement in the Manufacturing of Ceramic Bricks

A ceramic manufacturing company is interested in the incorporation of ceramic waste as a


substitute of clay in bricks, as reported in the scientific literature [9]. They have already
obtained some prototypes with competitive performance but the percentage of rejection (due
to breakage, cracking and color defects) is too high to make it economically feasible. Thus, they
decide to perform a Plackett-Burman experimental design with treatments considering the
following process variables:

A. Amount of clay replaced by ceramic waste ( – )


B. Amount of organic binder ( – )
C. Maximum firing temperature ( – )
D. Firing duration at maximum temperature ( – )
E. Firing/cooling rate ( – )

For each treatment, a batch of bricks is produced. After the quality control check, the total
percentage of rejection is determined. The results obtained are summarized in Table 1. The
minimum level of each factor is denoted as , whereas the maximum level is denoted as .

The data is analyzed with an optimal ANOVA [2] using a linear model without interactions. The
best model obtained (minimum mean squared error) does not include factor E:

(4.4)
Table 1. Experimental design for reducing brick rejection.
Exp# A B C D E % Rejection
1 1 -1 1 1 1 28%
2 1 1 -1 -1 -1 6%
3 1 1 -1 1 1 8%
4 -1 -1 1 -1 1 14%
5 -1 1 1 -1 1 8%
6 1 1 1 -1 -1 22%
7 -1 -1 -1 1 -1 4%
8 -1 1 -1 1 1 6%
9 -1 -1 -1 -1 -1 6%
10 -1 1 1 1 -1 10%
11 1 -1 -1 -1 1 12%
12 1 -1 1 1 -1 34%

15/06/2021 ForsChem Research Reports Vol. 6, 2021-09 (14 / 20)


www.forschem.org
Optimal Significance Level and Sample Size in
Hypothesis Testing. 4. Tests of Proportions
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

The model residuals are not normal, as they presented a normality value [10]. For
that reason, represents an arbitrary Type I standard random variable [11] instead of the
standard normal random variable. In addition, the average brick rejection described by the
model was with standard deviation . Considering an individual prediction then
̅ and Eq. (3.16) is confirmed. This means that out-of-range values would be predicted by
model (4.4) if the residuals are assumed to behave normally.

Considering the lack of normality in the residuals, a logistic transformation is proposed for the
brick rejection percentage. The best logistic model obtained is:

( )

(4.5)

In this case the residuals now behave normally with . For that reason, the residual
error is modeled using the standard normal random variable ( ). On the other hand, since the
logistic transformation may take values in the range ( ), no model prediction will be out
of range.

Considering the sign of the coefficients in model (4.5), the lowest brick rejection should be
obtained when the amount of ceramic waste is small ( ), the amount of organic binder is
large ( ), and the maximum firing temperature is low ( ). The firing duration and rate of
temperature change can be set using other criteria. For example, considering productivity the
firing duration should be low ( ) whereas the temperature rate should be high ( ).
Since this treatment was not part of the experimental design, it must be verified
experimentally. The optimal ANOVA obtained using model (4.5) is the following:

opt.anova(lm(log(R/(1-R))~(.)-D-E)
Analysis of Variance Table
Response: log(R/(1 - R))
Df Sum Sq Mean Sq F value Pr(>F) alphaopt Ds decision
A 1 2.1474 2.1474 27.1053 0.000816 0.17622 3.7741 1
B 1 0.5232 0.5232 6.6037 0.033139 0.17622 1.1772 1
C 1 3.5634 3.5634 44.9792 0.000152 0.17622 4.9541 1
Residuals 8 0.6338 0.0792

The effects of factors A, B and C were found statistically significant. Of course, the effects of
factors D and E are neglected. This means that they are either not significant or inconclusive
(likely in the limit of the test resolution). The largest effect observed was due to the maximum
firing temperature, closely followed by the amount of ceramic waste.

An experimental verification of batches of bricks using the optimal processing conditions


resulted in an average brick rejection of only .

15/06/2021 ForsChem Research Reports Vol. 6, 2021-09 (15 / 20)


www.forschem.org
Optimal Significance Level and Sample Size in
Hypothesis Testing. 4. Tests of Proportions
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

4.3. Efficacy of Pharmaceutical Products

Two pharmaceutical products for preventing an infectious disease were tested in clinical trials
in order to determine their efficacy. For each product, a placebo group of similar size was used.
The results of the clinical trials are summarized in Table 2 [12].

Table 2. Reported clinical trial data for two pharmaceutical products [12]
Product 162b2 1273
Infected 8 11
Test
Not infected 21712 15199
group
Total 21720 15210
Infected 162 185
Placebo
Not infected 21564 15025
group
Total 21726 15210

The calculated efficacy for the two products is presented in Table 3. From this information
different questions arise: Is the risk reduction of each product with respect to the control
statistically significant? Is there any statistically significant difference in the infection risk
between both products?

Table 3. Calculated efficacy for two pharmaceutical products [12]


Product 162b2 1273
Infection risk of tested product 0.0368% 0.0723%
Infection risk of control (placebo) 0.7457% 1.2163%
Absolute risk reduction (ARR) 0.7088% 1.1440%
Relative risk reduction (RRR) 95.060% 94.054%

The clinical trial was performed considering a single sample for each treatment. Thus, the
single-group approach must be used for analyzing the results. In order to answer the first
question, let us first check if the sample sizes used are adequate. Using Eq. (1.12) we find that
the minimum sample sizes required by the normal approximation are:
, , and
. This means that the sample sizes for the placebo groups were adequate, but the
sample sizes for the test groups were too small. Thus, a reliable statistical analysis based on the
normal approximation requires larger samples for the tested products (or a more convenient
sub-groups approach).

Anyway, let us perform the corresponding calculations using the binary individual observations.
First, a test resolution must be defined. A minimum test resolution can be determined based on

15/06/2021 ForsChem Research Reports Vol. 6, 2021-09 (16 / 20)


www.forschem.org
Optimal Significance Level and Sample Size in
Hypothesis Testing. 4. Tests of Proportions
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

the measurement resolution. Thus, the minimum test resolution for product 162b2 is
, and the minimum test resolution for product 1273 is
√ ( )

. Using these minimum test resolutions the following


√ ( )
results are obtained:

opt.mean.test(T1,C1,test="T",tails=1,dr=5.35e-4)
decision D Dcr Ds P alpha v test tails dr error
Ho is rejected 0.0030 0.0011 2.675 1.3e-32 0.3683 23882 T 1 5.35e-4 0.7366

opt.mean.test(T2,C2,test="T",tails=1,dr=6e-4)
decision D Dcr Ds P alpha v test tails dr error
Ho is rejected 0.0047 0.0013 3.508 5.3e-36 0.3686 17032 T 1 6.0e-4 0.7373

In both cases, a significant difference is observed in the risk of infection between the tested
products and the control. However, the minimum total test errors are larger than , making
the tests inviable. This, of course, is a consequence of the small magnitude for the minimum
resolution estimated from large sample sizes. While choosing a different test resolution is
possible, the selection may become subjective, and the conclusion obtained will depend on the
particular value chosen. For these reasons, a sub-group approach with multiple replicates of
the clinical trial is advisable for a more reliable conclusion.

On the other hand, if the infection risks of both products need to be compared, the minimum
test resolution becomes . Unfortunately, for this
√ ( )
resolution, the test is not only inviable but also inconclusive:

opt.mean.test(T1,T2,test="T",tails=1,dr=1.71e-3)
decision D Dcr Ds P alpha v test tails dr error
Inconclusive 5.72e-5 0.0011 0.0522 0.0811 0.3525 25706 T 1 0.00171 0.70492

In conclusion, when proportions with extremely small (or extremely large) probabilities are
tested, collecting information from multiple sub-groups may be more reliable than using a
single larger sample.

4.4. Mortality Rate Comparison

In this final example, the monthly mortality rates observed in Japan for different years [13] are
compared. The years considered are: , , , and . Reported monthly
death rates for these years are summarized in Table 4. This data is graphically presented in
Figure 4.

15/06/2021 ForsChem Research Reports Vol. 6, 2021-09 (17 / 20)


www.forschem.org
Optimal Significance Level and Sample Size in
Hypothesis Testing. 4. Tests of Proportions
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

An optimal ANOVA is performed with this data in order to analyze the effects of time (month
and year). The results obtained are the following:

Table 4. Monthly death rate in Japan for different years [13]


Month \ Year 2000 2010 2015 2019 2020
January 0.000775 0.000883 0.001056 0.001110 0.001024
February 0.000730 0.000765 0.000865 0.000929 0.000917
March 0.000700 0.000806 0.000896 0.000937 0.000937
April 0.000630 0.000772 0.000825 0.000893 0.000893
May 0.000607 0.000763 0.000808 0.000872 0.000854
June 0.000546 0.000706 0.000744 0.000802 0.000791
July 0.000572 0.000743 0.000782 0.000838 0.000827
August 0.000568 0.000747 0.000795 0.000875 0.000879
September 0.000548 0.000716 0.000773 0.000846 0.000848
October 0.000598 0.000768 0.000849 0.000897 0.000932
November 0.000625 0.000812 0.000839 0.000937 0.000936
December 0.000684 0.000866 0.000919 0.001005 0.001039

Figure 4. Monthly death rate in Japan for two different years: (Blue), (Green),
(Red), (Brown) and (Black)

opt.anova(lm(Rate~Month+Year))
Analysis of Variance Table
Response: Rate
Df Sum Sq Mean Sq F value Pr(>F) alphaopt Ds decision
Month 11 2.7670e-07 2.5154e-08 37.707 9.1705e-19 0.030749 13.019 1
Year 4 6.4194e-07 1.6048e-07 240.573 0.0000e+00 0.090054 28.095 1
Residuals 44 2.9350e-08 6.6700e-10

It is then concluded that significant changes in monthly death rate have taken place in Japan,
not only due to seasonal (monthly) effects, but also to year-to-year changes.

In order to compare individual years and taking into account the significant seasonal (monthly)
behavior of the death rate, paired tests are recommended. The results obtained for two-tailed
paired T-tests comparing the year with all other years are the following:

15/06/2021 ForsChem Research Reports Vol. 6, 2021-09 (18 / 20)


www.forschem.org
Optimal Significance Level and Sample Size in
Hypothesis Testing. 4. Tests of Proportions
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

opt.mean.test(Rate[Year=="Y2000"]-Rate[Year=="Y2020"],test="T",tails=2)
decision D Dcr Ds P alpha v test tails dr error
Ho is rejected 1.673 0.176 9.504 5.7e-10 0.0560 11 T 2 1.159 0.1120

opt.mean.test(Rate[Year=="Y2010"]-Rate[Year=="Y2020"],test="T",tails=2)
decision D Dcr Ds P alpha v test tails dr error
Ho is rejected 1.412 0.176 8.021 1.01e-8 0.0560 11 T 2 1.159 0.1120

opt.mean.test(Rate[Year=="Y2015"]-Rate[Year=="Y2020"],test="T",tails=2)
decision D Dcr Ds P alpha v test tails dr error
Ho is rejected 0.522 0.176 2.965 1.80e-4 0.0560 11 T 2 1.159 0.1120

opt.mean.test(Rate[Year=="Y2019"]-Rate[Year=="Y2020"],test="T",tails=2)
decision D Dcr Ds P alpha v test tails dr error
Ho is not rejected -0.209 0.176 -1.186 0.5557 0.0560 11 T 2 1.159 0.1120

These results indicate that the mortality rate in Japan in the year increased significantly
when compared to the years , and , but no significant differences were
observed compared to the year .

5. Conclusion

The statistical analysis of proportions can be done using conventional tests of means (Z-test or
T-test). This strategy is possible by considering the normal approximation of the corresponding
binomial distribution. It is important to remark that the analysis of proportions using tests of
means is therefore an approximation. This approximation may fail, for example, when out-of-
range values are predicted by the normal model. In those cases, a logistic transformation can
be used.

The T-test is a more realistic approximation for testing proportions than the Z-test, because the
true variance of the distribution is usually unknown (even though it is related to the mean value
of the distribution). The dependence of the variance on the mean value results in a highly
nonlinear behavior of the true Cohen distance. For this reason, a constant default test
resolution is not recommended. The test resolution must be determined for each particular
problem based either on a difference tolerance or on the proportion measurement resolution.

The binomial observations can be analyzed as a single group or as multiple mutually exclusive
groups of observations (e.g. replications). The main disadvantage of using the single-group
analysis is the large sample size usually involved, resulting in inviable or inconclusive tests.

Different application examples are included in order to illustrate the procedure proposed for
analyzing statistical tests of proportions.

15/06/2021 ForsChem Research Reports Vol. 6, 2021-09 (19 / 20)


www.forschem.org
Optimal Significance Level and Sample Size in
Hypothesis Testing. 4. Tests of Proportions
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

Acknowledgments

This research did not receive any specific grant from funding agencies in the public,
commercial, or not-for-profit sectors.

References

[1] Hernandez, H. (2021). Optimal Significance Level and Sample Size in Hypothesis Testing. 1.
Tests of Means. ForsChem Research Reports, 6, 2021-06. doi: 10.13140/RG.2.2.18643.09762.
[2] Hernandez, H. (2021). Optimal Significance Level and Sample Size in Hypothesis Testing. 2.
Tests of Variances. ForsChem Research Reports, 6, 2021-07. doi: 10.13140/RG.2.2.11266.20161.
[3] Hernandez, H. (2021). Optimal Significance Level and Sample Size in Hypothesis Testing. 3.
Large Samples. ForsChem Research Reports, 6, 2021-08. doi: 10.13140/RG.2.2.31487.33449.
[4] Thomopoulos, N. T. (2017). Statistical Distributions: Applications and Parameter Estimates.
Springer International Publishing. Chapter 14. p. 120.
[5] Hernandez, H. (2020). On the Discreteness of Measured Variables and the Continuous
Approximation. ForsChem Research Reports, 5, 2020-20. doi: 10.13140/RG.2.2.27740.00646.
[6] Hernandez, H. (2019). Sums and Averages of Large Samples Using Standard
Transformations: The Central Limit Theorem and the Law of Large Numbers. ForsChem
Research Reports, 4, 2019-01. doi: 10.13140/RG.2.2.32429.33767.
[7] Hernandez, H. (2021). Quantitative Analysis of Categorical Variables. ForsChem Research
Reports, 6, 2021-04. doi: 10.13140/RG.2.2.19233.33129.
[8] Scott, S. J., Jones, R. A., & Williams, W. (1984). Review of data analysis methods for seed
germination. Crop Science, 24(6), 1192-1199.
[9] Riaz, M. H., Khitab, A., Ahmad, S., Anwar, W., & Arshad, M. T. (2020). Use of ceramic waste
powder for manufacturing durable and eco-friendly bricks. Asian Journal of Civil Engineering,
21(2), 243-252.
[10] Hernandez, H. (2021). Testing for Normality: What is the Best Method? ForsChem Research
Reports, 6, 2021-05. doi: 10.13140/RG.2.2.13926.14406.
[11] Hernandez, H. (2018). Multidimensional Randomness, Standard Random Variables and
Variance Algebra. ForsChem Research Reports, 3, 2018-02. doi: 10.13140/RG.2.2.11902.48966.
[12] Brown, R. B. (2021). Outcome reporting bias in COVID-19 mRNA vaccine clinical trials.
Medicina, 57(3), 199.
[13] e-Stat Statistics of Japan (2021). Time Series Tables (Statistics Dashboard).
https://dashboard.e-stat.go.jp/en/timeSeries?fieldCode=02. Accessed: 10/06/2021.

15/06/2021 ForsChem Research Reports Vol. 6, 2021-09 (20 / 20)


www.forschem.org

You might also like