You are on page 1of 12

Building and Environment 45 (2010) 1202–1213

Contents lists available at ScienceDirect

Building and Environment


journal homepage: www.elsevier.com/locate/buildenv

Application of statistical power analysis – How to determine the right


sample size in human health, comfort and productivity research
Li Lan, Zhiwei Lian*
Institute of Refrigeration & Cryogenics, School of Mechanical Engineering, Shanghai Jiao Tong University, Shanghai, China

a r t i c l e i n f o a b s t r a c t

Article history: The minimum size of subjects required for the research on human health, thermal comfort and
Received 24 July 2009 productivity is a frequently asked question. In this paper the idea of power analysis, which helps to
Received in revised form determine required sample size as well as to interpret research results, is introduced in order to promote
4 November 2009
good practice of power analysis in the context of human and building environment relationship research.
Accepted 5 November 2009
How to calculate effect size from published article or experimental data is presented with plenty of
examples. The effect sizes of several physiological and psychological measurements indicating the effect
Keywords:
of indoor environment quality on human health, thermal comfort and productivity are presented, which
Indoor environment quality
Sample size could be worked as references when researchers planning their own studies. How to determine required
Power analysis sample size when planning a study and how to interpret the research results with power analysis are also
Thermal comfort illustrated step by step with samples. Finally how to make decisions when evaluating the study results is
Productivity summarized. It is expected that these examples and the summary could help researchers to better apply
power analysis in indoor environment quality (IEQ) studies. Some statistical terms used in this paper,
such as power analysis, effect size, and t-test, etc., are explained in detail in the Appendix.
Ó 2009 Elsevier Ltd. All rights reserved.

1. Introduction enough to be sensitive to the differences that may exist between


treatments. However, it should not be so large that the analysis
People spend around 90% of their lives indoors. The indoor produces results that are statistically significant yet practically
environment must safeguard and enhance occupants’ health, trivial. In an undersized experiment the subjects are exposed to
comfort and productivity. There is a continuous and dynamic potentially harmful treatments without advancing knowledge,
interaction between occupants and their surroundings that while in an oversized experiment an unnecessary number of
produce physiological and psychological effects on the person. subjects are exposed to a potentially harmful treatment, or are
Occupants who experience even sub-clinical symptoms such as denied to a potentially beneficial result. Frequently asked question
headache, eye symptoms and fatigue are less likely to be in these days is how many subjects are really needed for a thermal
comfortable and also less likely to be productive. Recently much comfort or productivity study. However, the statistical power
attention has been paid to the consequence for building users of analysis which helps to calculate required sample size for a study
implementation of measurable energy savings in houses, public has been largely neglected, including the authors [1]. One number
buildings and industrial buildings. Many studies have been carried may illustrate the situation: the authors have checked 30 published
out to study the effect of indoor environment quality (thermal papers (in such journals as Human Factors, Indoor Air, etc.) that
environment, indoor air quality, light, etc.) on human comfort and investigated the relationship between indoor environment quality
productivity; statistical tests are routinely applied but the control of and occupants’ productivity or performance, but none of them
statistical power cannot be taken for granted. Statistical studies are conducted power analysis, including some eminent researchers in
always better when they are carefully planned. One important this area [2–5].
aspect of good planning is that the study must be of adequate size, It is increasingly common for researchers to use and report
relative to the goals of the study. The sample should be large statistical power or sample size estimates in proposals and pub-
lished research. For example, when investigating the effect of
levodopa on odor identification in humans, Rösser et al. (2008)
* Corresponding author. þ86 21 34204263; fax: þ86 21 34206814. stated ‘‘Sample size was determined by power analysis (NQuery)
E-mail address: zwlian@sjtu.edu.cn (Z. Lian). based on a previous study of our group’’ [6]; in a research on the

0360-1323/$ – see front matter Ó 2009 Elsevier Ltd. All rights reserved.
doi:10.1016/j.buildenv.2009.11.002
L. Lan, Z. Lian / Building and Environment 45 (2010) 1202–1213 1203

effect of creatine on cognitive function in young adults, Rawson One motivation for the use of retrospective power calculation is
et al. (2008) mentioned ‘‘Sample size was estimated based on data the desire to assess the strength of evidence for a null hypothesis –
from McMorris et al. and conducted assuming a power (1  b) of something that standard hypothesis tests are not designed for [8].
0.80 and a ¼ 0.05.’’ [7]. However, the lack of power analysis in But the observed power can never fulfill the goals [12]. The
research area on interaction between human and building envi- calculations of a retrospective power calculation is done by esti-
ronment may partly be due to a poor understanding of what it is, mating the population effect size using the observed effect size
what it can tell us and how it works. An understanding of statis- among the sample data, so it is based on the highly questionable
tical power supports two main applications [8]: (1) to estimate the assumption that the sample effect size is essentially identical to the
prospective power of a study, and (2) to estimate the parameters effect size in the population from which it was drawn [13]. Obvi-
required to achieve a desired level of statistical power for ously, this assumption is likely to be false, and the more so the
a proposed study. The aim of this paper is to promote good practice smaller the sample. In fact, the observed power is determined
of power analysis in study on relationship between human and completely by the significance level of a test (P-value) and there-
building environment by introducing the concept of power anal- fore adds nothing to the interpretation of results [12]. One
ysis and illustrating how to apply power analysis. The required important thing should be noted here is that, post hoc analysis, like
sample size for some commonly used statistical tests and the a priori analysis, requires researchers to specify population effect
expected effect size of some productivity measurements are size on a priori grounds.
presented. However, retrospective power is frequently interpreted in much
the same way as post hoc power analysis. This is particularly
2. What is power analysis? problematic when retrospective power calculations are used to
interpret results of significance test [8]. As the observed power is
The statistical power of a study depends on two main factors: a mere function of the observed effect size and hence of the
how big an effect (the effect size) the research hypothesis predicts observed P-value, so in general, statistical significance (lower
and how many subjects are in the study (the sample size) [9]. In any P-value) will result in high observed power and non-significance
study, the bigger the difference you expect between the two pop- (P > 0.05, etc.) will result in low observed power. If the observed
ulations, that is, the greater the effect size, the more statistical power for non-significant results is used as an indication of the
power in the study. As to the sample size, basically, the more people strength of evidence for the null hypothesis, it will erroneously
there are in the study, the more statistical power. Sample size suggest that the lower a P-value the stronger the evidence is in
affects statistical power because the larger the sample size, the favor of the null hypothesis [12]. For significant results high
smaller the standard deviation of the means. Statistical power is observed power will act to (falsely) strengthen the conclusions that
also affected by the significance level chosen, whether a one-tailed the researcher has drawn. In either case retrospective powers are
or two-tailed test is used, and the kind of hypothesis-testing highly undesirable [8].
procedure used. Further details on power analysis can be found in
the Appendix A. 2.2. The importance of sufficient statistical power

2.1. Types of power analysis Statistical significance is extremely important, but sophisticated
researchers and readers of research understand that there is more
A priori and post hoc power analyses are the two most common to the story of a study’s result than P < 0.05 or ns (not significant).
types of power analysis. A priori power analysis is usually used to Following Cohen’s (1962) pioneering work on the power of statis-
determine the necessary sample size N of a study and provides an tical tests in behavioral research [14], many authors have stressed
efficient method of controlling statistical power before a study is the necessity of statistical power analysis.
actually conducted, therefore can be recommended whenever Sample size calculations and power analysis are often critical for
resources such as the time and money required for data collection researchers to address specific scientific hypotheses and confirm
are not critical [10]. Post hoc analysis is less ideal than a priori credible treatment effects. Determining statistical power is very
analysis because only a is controlled, not b [11]. A post hoc analysis important when planning a study. If you do a study in which the
can be used to assess whether or not a published statistical test in statistical power is low, even if the research hypothesis is true, the
fact had a fair chance of rejecting an incorrect null hypothesis. study will probably not give statistically significant results. Thus,
There is a third variant – compromise power analysis, which can be the time and expense of carrying out the study would probably not
useful both before and after data collection [10]. It provides be worthwhile [15]. Calculating statistical power when planning
a pragmatic solution to the frequently encountered problem that a study helps to determine how many subjects are needed. If the
the ideal sample size N calculated by an a priori power analysis sample size is too small, the statistical tests may not be adequate to
exceeds the available resources. In such a situation, a researcher detect a difference that in reality is there. The smaller the sample is
could specify the maximum affordable sample size and using or the smaller the true difference if it exists, the greater is the
a compromise power analysis to compute a and 1  b associated probability of accepting the null hypothesis in error. The impor-
with a P-value. Alternatively, if a study has already been conducted tance of this to the experiments or surveys should be obvious. If, for
but has not yet been analyzed, a researcher could ask for example, the indoor environment quality has negative effects on
a reasonable decision criterion that guarantees perfectly balanced occupant’s health and productivity, but the effects are not shown
error risks given the size of the sample and the critical effect size in obviously or not in large difference at the early stage or during
which he or she is interested. In this paper we will focus on the first a short investigation period, which is especially true for laboratory
two types of power analysis. studies, the t-test or analysis of variance (ANOVA) may lead the
Post hoc power analysis should not be confused with the so- researcher to accept the null hypothesis and say that there is no
called retrospective power analysis, in which the effect size is significant effect when in reality there is. However the consequence
estimated from the sample data of the study and used to calculate should be serious as the negative effect of indoor environment
the observed power, a sample estimate of the true power [10]. quality on occupant’s health should be prevented at its early stage.
Computer software such as SPSS readily performs these calcula- Moreover, even a small productivity loss may result in a large
tions under the guise of ‘‘observed power’’. economic decrease, since the cost of the people in an office is an
1204 L. Lan, Z. Lian / Building and Environment 45 (2010) 1202–1213

order of magnitude higher than the cost of maintaining and oper- 3.1. Methods for effect size calculation
ating the building [16]. So with inadequate statistical power, the
work was likely to be not only a waste of time and energy, but also 3.1.1. Calculation of effect size of t-tests
misleading. For the two types of t-test, Cohen (1988) defined the effect size
Understanding statistical power is also extremely important of 0.2, 0.5, and 0.8 as small, medium, and large, respectively [9].
when looking at the results of a research, particularly for making Cohen’s d is calculated with formula (1).
sense of results that are not statistically significant or results that
are statistically significant but not practically significant. The x1  x2
d ¼ (1)
commonest factors which make a test unable to detect a change S
include too few samples, too small a difference between the means, sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
and a large variation in the values making up the means. Thus it is ðn1  1ÞS21 þ ðn2  1ÞS22
needed to check whether or not the test could have shown S ¼ (1a)
n1 þ n2
a difference where the difference existed in reality. A statistically
not significant result from a study with high statistical power does where x ¼ group mean, S ¼ standard deviation, n ¼ number of
suggest that either the research hypothesis is false or that there is subjects, subscripts 1 and 2 refers to the two groups.
less of an effect than as predicted. For example, if a study investi- To estimate Cohen’s d from published article, the means,
gating the effect of ventilation rate on human cognitive function is number of subjects, as well as the standard deviation of the two
planned with a statistical power of 90% when the effect size was groups should be listed. For example, Kahl (2005) investigated the
estimated to be 0.5 and come out with statistically significant effect of room temperature on mental task performance with 176
difference, then we can assure that there is 90 percent that the subjects (140 women and 36 men) [19]. The performance (mean -
ventilation rate really affects cognitive function and human  standard deviation) of Reading task of the male group was
productivity. Implicit in this is that if no statistically significant 7.63  3.17, of the female group was 6.04  2.48. With these data
result is found, there is a 90% chance that the ventilation rate really first the S can be calculated with formula (1a), as 2.618. Then the
does not affect human cognitive function or the effect size was in Cohen’s d can be get with formula (1). The calculation result is 0.61,
fact less than 0.5. While a statistically not significant result from indicating that males performed better than females with
a study with low statistical power is truly inconclusive. However, as a medium size effect.
Cohen (1988) put it, ‘the null hypothesis had been mistakenly not When an experiment that uses a t-test does not list standard
rejected and a real effect was ignored because of inconclusive deviations but does list standard errors (SE), the standard devia-
results – indeed, often treated as nonexistent’ [9]. tions can be calculated as follows associated with the number of
subjects:
3. Calculation of effect size pffiffiffi
S ¼ SE n (2)
In hypothesis testing when a result has a small P-value, we say
The t statistic can be used to calculate Cohen’s d when a research
that it is ‘‘statistically significant’’. In common usage, the word
that uses a t-test does not list standard deviation:
significant means ‘‘important’’. It is therefore tempting to that
statistically significant results must always be important. This is not sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
  
the case, as the P-value does not measure practical significance. n1 þ n2 n1 þ n2
d ¼ t (3)
What it does measure is the degree of confidence we can have that n1 n2 n1 þ n2  2
the true value is really different from the value specified by the null
In this situation, the t statistics and the number of subjects
hypothesis. When the P-value is small, then we can be confident
within each group should be listed. Usually the published article
that the true value is really different. This does not necessarily
will show the t statistics. Here is another example. Raymore et al.
imply that the difference is large enough to be of practical impor-
(2001) compared the socioeconomic index (SEI) of students who
tance. Sometimes statistically significant results do not have any
went away from home to go to college versus those who stayed at
scientific or practical importance. It is effect size that measures the
home. Here is an excerpt from their results section: ‘‘.females who
difference between the true value and the value specified by the
had left home were from higher SEI homes (N ¼ 115) than college
null hypothesis and hence indicates whether the difference is
females who had not left home (N ¼ 74) (t ¼ 4.19, df ¼ 187, p < 0.5)’’
practical important. Cohen came up with some effect size
[20]. Using these data, the Cohen’s d is 0.63, calculated with
conventions based on the effects found in psychology and behav-
formula (3).
ioral research in general [9]. The effect size is discussed in detail in
If the article does not list the number of subjects in each group
Appendix A.
but does list the total number of subjects, the Cohen’s d can be
In the practical setting the population values are typically not
estimated using formula (3a), assuming that both groups have
known and must be estimated from sample statistics. There are
roughly equal numbers of subjects.
several versions of effect size based on means differing with which
statistics are used. Cohen’s d, which is defined as the difference t
between two means divided by a standard deviation for the data dzpffiffiffiffiffiffiffiffiffiffiffiffi (3a)
n2
[9], is one of the commonly used measurements of effect size. This
paper provides some simple methods and examples to calculate However, formula (3) and (3a) cannot be used for repeated-
Cohen’s d for both t-tests and ANOVA from experimental data as measure designs. The paired-samples t-test is used to test the null
well as published research [17,18]. Cohen’s d has two advantages hypothesis that the average of the differences between a series of
over other effect size measurements [18]. First, its burgeoning paired observations is zero. Observations are paired when, for
popularity is making it the standard. Thus, its calculation enables example, they are performed on the same samples or subjects. The
immediate comparison to increasingly larger number of published effect size d is defined as:
studies. Second, Cohen’s suggestion for effect size conventions
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
enables us to compare an experiment’s effect size results to known
d ¼ jxz j=Sz ¼ jx1  x2 j= S21 þ S22  2r12 $S1 $S2 (4)
benchmarks.
L. Lan, Z. Lian / Building and Environment 45 (2010) 1202–1213 1205

where r12 denotes the correlation between the two random vari- on subjective physical comfort across all task: reading
ables. xz and Sz are the group means and standard deviation of the (F(3,174) ¼ 4.99, p < 0.01) .’’. In this example, dfBetween ¼ 3,
paired observations z. dfWithin ¼ 176, F ¼ 4.99. Therefore,
Comparing formula (1) with formula (4), it can be seen that the
main difference in calculation of effect size between t-test and F$dfBetween 4:99$3
h2 ¼ ¼ ¼ 0:078
paired-samples t-test is the correlation parameter r12. Paired- F$dfBetween þ dfWithin 4:99$3 þ 176
samples t-test is used for situation in which each subject is
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
measured twice and the data of each measure are dependent on qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
 ffi 0:078
each other. Therefore the calculation of effect size of paired-sample f ¼ h2 = 1  h2 ¼ ¼ 0:29
ð1  0:078Þ
t-test should take into account of the correlation between the two
scores. The correlation parameter r12 is always larger than zero. The The result indicates that there was only a medium effect of
prior evidence suggests that for thermal comfort or productivity temperature on subjective physical comfort for reading task,
measurements the value of r12 varies from 0.7 to 1.0 [1,22–24]. although the effect was highly statistically significant.
Here is one more example. In a repeated-measures experiment,
the increment of transcranial magnetic stimulation (TMS)-evoked
thumb movements falling in the training target zone (TTZ) was 3.3. IEQ study
8.8  2.7% with levodopa, and 2.6  1.0% with placebo respectively
expressed in means and stand error [21]. The standard deviations In Sections 3.1 and 3.2, the methods to calculate effect size from
calculated with formula (2) were 8.1% and 3.0% respectively. The published researches have been provided with some examples.
correlation r12 between the two set data was not reported unfor- However, when searching for published researches on this purpose,
tunately, and r12 ¼ 0.8 was assumed based on our previous studies it was found that most published research on IEQ study did not
[22,23]. Therefore the effect size calculated with formula (4) would present enough or standard statistical results, only indicating
be 1.04. P-value (such as P < 0.01), therefore it is impossible to calculate
It can be seen from formula (1) and (4) that effect size is affected effect size based on their results. In this section, the effect sizes
considerably by the variation of the samples. For a particular pair of of some productivity research conducted by the authors are
means, as standard deviation increases, the effect size becomes presented.
smaller. Compared with the t-test with independent samples, the The methods of productivity measurement can be classified into
effect size of paired-samples t-test will always be larger because it three categories [25]: (1) performance measurement; (2) physio-
takes into account of the correlation between the two scores. logical parameter measurement; and (3) subjective questionnaires.
The performance measurements include neurobehavioral tests and
3.2. Calculate effect size of analysis of variance (ANOVA) simulated office work. The distinguishing characteristic of neuro-
behavioral approach is its emphasis on the identification and
Cohen (1988) defined the effect size of 0.1, 0.25, and 0.4 as small, measurement of behavioral deficits, for the influence of environ-
medium, and large, respectively for the F-test [9]. Cohen’s f is an ment on brain functions manifests behaviorally. Neurobehavioral
appropriate effect size measure in the context of an F-test for approach is neurobiologically justified since the central nervous
ANOVA. The Cohen’s f is defined as: system displays particular sensitivity to environmental distur-
bance. As a result, behavioral changes represent an avenue through
f ¼ Sm =S (5) which to evaluate early and less obvious effects of environmental
factors. The rationale for using physiological methods is based on
In formula (5) Sm is the standard deviation of the group means xi, the reasoning that physiological measure of activation or arousal
and S is the common standard deviation within each groups. are associated with increased activity in the nervous system, which
A different but equivalent way to specify the effect size is in is equated with an increase in the stress on the worker. Common
terms of partial Eta squared h2, which is defined as physiological measurements include: (a) cardiovascular measures
(heart rate, blood pressure); (b) respiratory system (respiration
S2Between $dfBetween rate, oxygen consumption); (c) nervous system (brain activity,
h2 ¼ (6)
SBetween $dfBetween þ S2Within $dfWithin
2
muscle tension, pupil size); (d) biochemistry (catecholamine).
Subjective questionnaires include rating scales for self-rated
where S2 is the population variance, df is the degree of freedom,
performance, workload assessment (e.g. NASA_TLX), or motivation
subscripts Between and Within mean between-group and within-
to do work. The rating scales are useful tools in tapping worker’s
group.
internal feelings.
That is, h2 is the ratio of the between-groups variance and the
The effect size of the three types of measurement are calculated
total variance and can be interpreted as ‘‘proportion of variance
based on our former studies [1,22–24], as shown in Table 1, which
explained by group membership’’. When the between-groups and
can be worked as a reference for figuring out required sample size
within-groups variance estimates are not available, as is often true
when planning similar studies. The partial Eta squared h2 is the
in published research, it is possible to figure out h2 directly from
analysis result of experiment data with the SPSS software. The ratio
F statistic and the degrees of freedom. The formula is:
of correct ratio or memory capacity (accuracy) and the response
F$dfBetween time of each test were used as the performance index of neuro-
h2 ¼ (7) behavioral tests. The effect size conventions for ANOVA (the effect
F$dfBetween þ dfWithin
size of 0.1, 0.25, and 0.4 as small, medium, and large, respectively)
The relationship between h2 and f is: were applied here.
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi As shown in Table 1, the effect size indicates that there was
 ffi
f ¼ h2 = 1  h2 (8) a large variation of heart rate variability (HRV) with perception of
thermal comfort, suggesting that HRV may be related to thermal
Consider once again Kahl’s (2005) study [19]. Here is an excerpt comfort and it may be useful to understand the mechanism of
from the results section: ‘‘There was a main effect of temperature thermal comfort [22,23]. It also can be seen from Table 1 that room
1206 L. Lan, Z. Lian / Building and Environment 45 (2010) 1202–1213

Table 1
Effect size of three types of productivity measurement: performance measurement,
physiological parameter measurement, and subjective rating questionnaires
[1,22–24].

Measurements Items Partial Eta Effect size fa


squared h2
Neurobehavioral Letter search 0.038 0.200 (M)
tests Overlapping 0.065 0.264 (M)
Number calculation 0.204 0.506 (L)
Conditional reasoning 0.107 0.346 (M)
Spatial image 0.027 0.166 (S)
Picture recognition 0.018 0.135 (S)
Symbol-digit modalities test 0.032 0.182 (S)
Visual choice RT 0.023 0.153 (S)
Physiological Heart rate variability (HRV) 0.259 0.591 (L)
measurement
Subjective ratings Overall workload (NASA_TLX) 0.102 0.337 (L)
Motivation 0.103 0.339 (L)
Well-being 0.302 0.658 (L)
Emotion (POMS) 0.138 0.400 (L)
a
Effect size: S – small; M – medium; L – Large.
Fig. 1. Sample size as a function of statistical power of t-test for three different effect
sizes: small (0.2), medium (0.5), and large (0.8). Continuous lines are for t-test with
paired-samples, dashed lines are for t-test with independent samples. a for all curves
temperature had relatively large effects on subjective ratings, but
is 0.05.
had only small or medium effects on the performance of neuro-
behavioral tests. The results demonstrate that when individuals
were encouraged to do their best, they could maintain their using G*Power 3, which is a noncommercial program for per-
performance under adverse conditions for a short time by exerting forming various types of power analysis and can be downloaded
more effort. However, it is highly questionable whether this holds free of charge at www.psycho.uni-duesseldorf.de/abteilungen/aap/
true for sustained periods of actual work, as the subjective ratings gpower3.
indicate that subjects experienced more negative emotions and The graphs shown in Figs. 1–4 were drawn with calculated data
more sick building syndrome (SBS) symptoms and therefore had of G*Power 3. Several ‘varieties’ of each test exist, e.g. one-tailed and
lower motivation to do work under adverse conditions [1,24]. two-tailed t-tests, and numerous kinds of analysis of variance. For
the t-tests, two-tailed results of t-tests with dependent and inde-
4. Some quick guides to sample size calculation pendent sample size are shown. As to ANOVA, the one-way fixed
effects ANOVA and one-way ANOVA for repeated measures are
When planning a study, the main reason researchers consider shown. With G*Power 3, sample size is the total number of samples
statistical power is to decide how many subjects should be included of all groups. The sample size shown in Figs.1–4 refers to the number
in the study. Sample size has an important influence on statistical of subjects of each group. For example, if the required sample size is
power. Thus, a researcher wants to be sure to have enough people 10 as shown in Figs. 1–4 and there are 2 groups in the experiment,
in the study to have fairly high statistical power. A priori power then each group should have 10 subjects. In this way the difference
analysis provides an efficient method of controlling statistical of statistical power between within-subject design and between-
power before a study is actually conducted and can be recom- subject design can be illustrated more clearly. Between-subject
mended whenever resources such as the time and money required design and within-subject design are two basic research designs
for data collection are not critical. The magnitude of predicted effect
size is needed when figuring out required sample size. One
common method is to run a pilot study. A successful pilot study
should provide reasonable estimates of the necessary population
parameters that contribute to the standardized effect size [10]. A
pilot study is not always possible and some alternative ways to
estimate effect size are available, such as, based on some precise
theory, on previous experience with research of similar kind, or on
what would be the smallest difference that would be useful
according to Cohen’s conventions. In section 3, the methods to
calculate effect size from published research and the effect size of
some productivity studies, as well as Cohen’s conventions have
been presented. They are expected to help researchers to figure out
the required sample size of their own study.
The needed sample sizes and the statistical power for commonly
used t-tests and F-tests are illustrated in Figs. 1–4. They can work as
quick guides to calculation of sample size and statistical power of
some generic tests. As to a particular experiment or result,
researchers should still conduct their own more specific analysis.
There are available today several excellent software packages
which make the analysis so straightforward that there is now no Fig. 2. Sample size as a function of effect size of t-tests for different statistical powers
excuse of not doing so [26]. Further, some of them are available at (0.6, 0.8, and 0.9). Continuous lines are for t-test with paired-samples, dashed lines are
no cost. Data in the following tables and graphs were calculated for t-test with independent samples. a for all curves is 0.05.
L. Lan, Z. Lian / Building and Environment 45 (2010) 1202–1213 1207

4.1. Power analysis of t-test

Fig. 1 shows how required sample size changes with statistical


power, using different effect sizes (small, medium and large) for
t-test with paired-samples and with independent samples. Fig. 2
shows required sample size as a function of effect size, for statistical
power varying from 0.6 to 0.95. It should be noted that, the dashed
lines indicate the sample size of each group for the independent
samples assuming the number of subjects is identical in each group.
For example, if the calculated total required sample size is 120 for
the t-test with independent samples, then the sample size of each
group is 60. As shown in Figs. 1 and 2, even with a same number of
subjects in each group, the between-subject design is less powerful
than the within-subject design. That is, for between-subject design
more than double samples as the within-subject design is needed
to achieve the same statistical power. It also can be seen from Figs. 1
and 2 that more sample is needed when the effect size decreases or
the required statistical power increases.

4.2. Power analysis of ANOVA


Fig. 3. Sample size as a function of statistical power of ANOVA for three different effect
sizes: small (0.1), medium (0.25), and large (0.4). Continuous lines are for one-way
ANOVA for repeated measures, dashed lines are for one-way ANOVA. a for all curves is
The results of one-way ANOVA and one-way ANOVA for
0.05. For one-way ANOVA for repeated measures, nonsphericity correction 3 is 0.5, and repeated measure are shown and compared, with a 4-level inde-
the correlation among repeated measures is 0.5. pendent variable. Fig. 3 shows sample size as a function of statis-
tical power for different effect size (small, medium and large). Fig. 4
associated with the experimental research strategy [27]. In shows how sample size change with effect size for statistical power
between-subject design (also called independent-measures ranging from 0.6 to 0.95. Also the dashed lines indicate the sample
design), the results of separate groups of subjects are compared, size of each group for the independent samples assuming the
while a within-subject design is an experiment in which the same number of subjects is identical in each group. The one-way ANOVA
group of subjects is repeatedly exposed to more than one treatment. for repeated measures is based on the sphericity assumption. This
So as for the above example, the total sample size should be 20 if assumption is correct if (in the population) all variances of the
a between-subject design is used, while the total sample size still repeated measurement are equal and all correlations between pairs
should be 10 if a within-subject design is used. The curves in each of repeated measurement are equal. If all the distributional
graph assume that the number of samples is identical in each group. assumptions are met, then the one-way ANOVA for repeated
If groups contain different sample sizes, the power analysis software measures is the most powerful method [10]. Unfortunately, the
package should be referred to, from which power values may be assumption of equal correlations is violated quite often, which can
obtained for any configuration of sample sizes. It should be noted lead to very misleading results. In order to compensate for such
that, for a given total sample size and a given effect size, the more adverse effects in tests, the noncentrality parameter and the
unequal the groups, the smaller will be the statistical power [11]. degrees of freedom of the F distribution can be multiplied by
a correction factor 3. 3 ¼ 1 if the sphericity assumption is met and
approaches 1/(m  1) with increasing degrees of violation of
sphericity, where m denotes the number of repeated measure-
ments. In Figs. 3 and 4, the curves for repeated measures are based
on 3 ¼ 0.5. It also can be seen from Figs. 3 and 4 that repeated
measures are more powerful than independent measures and less
sample is needed to detect the given effect size.

5. Application of power analysis

5.1. An example of a priori power analysis

The technical and non-technical facets of power analysis require


that researchers think carefully about their research prior to col-
lecting data. Fig. 5 illustrates the procedure to figure out the
required sample size with a priori power analysis. There are five
steps to calculate the required sample size: (1) Specify a hypothesis
test on a parameter that the researcher is interested in (along with
the underlying probability model for the data). Which type of
hypothesis it is? One-tailed or two-tailed? (2) Determine what type
of experiment design as well as the statistical model will be used.
For example, if within-subject design is used, then paired-samples
Fig. 4. Sample size as a function of effect size of ANOVA for different statistical powers t-test or ANOVA for repeated measures should be applied. (3)
(0.6, 0.8, and 0.9). Continuous lines are for one-way ANOVA for repeated measures,
dashed lines are for one-way ANOVA. a for all curves is 0.05. For one-way ANOVA for
Determine statistical power level and significance level. Just as
repeated measures, nonsphericity correction 3 is 0.5, and the correlation among mentioned, usually a study with a statistical power of 80% or more
repeated measures is 0.5. is worth doing. As to the significance level, 0.05 is usually used. (4)
1208 L. Lan, Z. Lian / Building and Environment 45 (2010) 1202–1213

with software from more basic parameters characterizing the


expected population scenario, as shown at the right side of Fig. 6.
Now we have all of the specifications needed for calculation of
sample size using the priori power analysis and come to the last
step, that is, figure out required sample size with software or
graphs for power analysis. Based on these data, the required
sample should be around 10 if we check the curves shown in
Fig. 4. The required sample size calculated with G*Power software
is 11, as shown in Fig. 6. Note that the actual power will often be
slightly larger than the pre-specified power. The reason is that
non-integer sample sizes are always rounded up to obtain integer
values consistent with a power level not lower than the pre-
specified one.

5.2. Examples of post hoc power analysis

Post hoc power analysis often makes sense after a study has
already been conducted and helps to interpret the study results. Do
the results that are statistically not significant prove no effects of
treatments, or are they just inconclusive results? The procedure of
post hoc power analysis is similar to a priori power analysis and can
be divided into 3 steps: (1) gather the needed information: the type
of hypothesis test and statistical model, and sample size, etc.; (2)
determine the significance level and the expected effect size. The
effect size is also estimated by the above mentioned methods; and
(3) calculate the statistical power with software of graphs. Tsutsumi
et al. (2007) investigated the effect of humidity on human
productivity under transient conditions from hot and humid
environment to thermally neutral condition with 12 subjects in
repeated-measures [28]. No significant effect of humidity on
occupants’ performance was found. So can we assume that
humidity has no effect on human productivity? How powerful the
study is? A post hoc power analysis is performed to interpret the
Fig. 5. Procedure of sample size calculation with a priori power analysis.
result. As shown in Fig. 7a, the post hoc power analysis is selected
and the two-tailed paired-samples t-test is selected as the statis-
tical model. The total sample size is 12 and a is set to be 0.05. Then,
Determine the expected effect size. As we have mentioned, the a medium effect size is estimated on a priori grounds to be 0.5
magnitude of predicted effect size can be estimated by many ways, referring to Table 1. Now all needed input data are ready. With these
such as running a pilot study, based on some precise theory, on data, the statistical power of the study is calculated to be 0.35 only.
previous experience with research of similar kind, or on Cohen’s The relatively low power may suggest that the statistically not
conventions. (5) Figure out required sample size with software or significant results from the study are only inconclusive results and
graphs or tables for power analysis. Next how to determine sample should not come to the conclusion that humidity has no effect on
size step by step with a priori power analysis is illustrated with an human productivity; validation of the results in further investiga-
example. tions with more subjects is needed.
An experiment will be performed to study the physiological Let’s consider another example of post hoc power analysis. Bakó-
mechanism of effects of air temperature on human productivity. Biró et al. (2004) investigated the effect of presence of personal
HRV is selected as the physiological index assessing the autonomic computers (PCs) on perceived air quality, SBS symptoms and
nervous system function. Four temperature levels, from cold to productivity [2]. It was reported that performance of text typing
warm, will be investigated. The first step is to specify hypothesis was significantly reduced when PCs were present. Thirty female
test on a parameter. It is supposed that either cold or warm will subjects were exposed to the two conditions-the presence or
increase HRV value, so it is a two-tailed test. The second step is to absence of PCs. The performance data were analyzed using the
determine the type of experimental design. Within-subject design paired t-test. Reported P-values were for a one-tailed test, i.e., in the
is used in this study, so the ANOVA with repeated measures will be expected direction that the presence of PCs had negative effects on
used as the statistical model. The nonsphericity correction 3 is set productivity. With these data, a post hoc power analysis is per-
to be 0.5, and the correlation factor among repeated measures is formed on this study, as shown in Fig. 7b. Again a medium effect size
set to be 0.5. The number of groups and repeated measures also is estimated on a priori grounds to be 0.5 referring to Table 1. It can
can be determined. In this study, all subjects will be exposed to the be seen that the statistical power of study is 0.85, which is higher
four temperature conditions, so there is one group with a repeti- than the widely required level (0.8). The post hoc power analysis
tion number of 4. Then the level of statistical power and signifi- implicate that the results of this study should be statistical reliable
cance value should be determined. The statistical power level is and important, considering the sample size is not very large.
set to be 0.8, and the significance level a is set to be 0.05,
according the widely used rule. The fourth step is to determine the 5.3. Proper application of power analysis
expected effect size. In this example, the expected effect size can
be estimated based on the previous study and determined to be Up to now we have understood that a priori power analysis is
0.5 (as shown in Table 1). The effect size can also be calculated used to determine the necessary sample size N of a test given
L. Lan, Z. Lian / Building and Environment 45 (2010) 1202–1213 1209

Fig. 6. Example of calculation of required sample size with power analysis software.

a desired a level, a desired power level (1  b), and the size of the The message here is that in judging a study’s results, there are
effect to be detected with probability 1  b. This provides an effi- two questions. First, is the result statistically significant? If it is, you
cient method of controlling statistical power before a study is can consider there to be a real effect. The next question then is
actually conducted. whether the effect size is large enough for the result to be useful or
One motivation to do a priori power analysis is to make sure that interesting, especially if the study has practical implications. It the
enough subjects are involved in the study. Therefore, many people sample is small, you can assume that a statistically significant result
may assume that the more subjects in the study, the more impor- is probably also practically important. But if the sample is very
tant its results. In a sense, just the reverse is true [29]. A study with large, you must consider the effect size directly, as it is quite
a very small effect size may also come to statistical significant. This possible that the effect size is too small to be useful.
is likely to happen when a study has high statistical power due to If the result is not statistically significant, the statistical power of
other factors, especially a large sample size. the study is considered. A non-significant result from a study with

Fig. 7. Example of post hoc power analysis indicating (a) relatively low statistical power; and (b) acceptable statistical power.
1210 L. Lan, Z. Lian / Building and Environment 45 (2010) 1202–1213

low statistical power is truly inconclusive. However, a non-signifi- varies with the independent variable. Carryover effects should be
cant result from a study with high statistical power does suggest controlled with balanced design when within-subject design is
that either the research hypothesis is false or that there is less of an utilized [27].
effect than was predicted when figuring statistical power [29]. It should also be kept in mind that effect size should be given
Table 2 summarizes the role of significance, sample size, and emphasis in the discussion of results. The Publication Manual of the
statistical power in interpreting research results. American Psychology Association stated the accepted standard of
how to present psychology research results, ‘‘For the reader to fully
understand the importance of your findings, it is almost always
6. Discussion necessary to include some index of effect size.’’ [29]. Effect size
not only plays an important role in power analysis, but also is
For some studies, it may be surprising to see how large samples a crucial ingredient in meta-analysis. Meta-analysis is an important
are needed in order to detect the predicted effect with sufficient development in recent years in statistics that has had profound
statistical power. Sample size is but one of several quality charac- effect on many fields, especially psychology and behavioral studies.
teristics of a statistical study; so if sample size is held fixed, we This procedure combines results from different studies, even
should focus on other aspects of study quality. For example, better results using different methods of measurement. When combining
instruments can be found that will bring the study up to a reason- results, the crucial thing combined is the effect sizes. Using meta-
able standard. Possible improvements on study design may also analysis, the researchers can combine the results of several studies
help to reduce the variance of effect size estimation, as illustrated that evaluated the effect of indoor environment quality on occu-
by the curves shown in Figs. 1–4. A fundamental advantage of the pants’ productivity or other aspects, and thus would provide an
within-subjects design is the reduction in error variance associated overall effect size evaluating to which extent the productivity is
with individual differences, due to the fact that in a between- affected by the change of working conditions. It would also tell how
subject design even though you randomly assigned subjects to effect sizes differ for studies done in different countries or different
groups, the two groups may differ with regard to important indi- populations. So including effect size when reporting the results of
vidual difference factors that affect the dependent variable. With a study could help future researchers to figure out statistical power
within-subject design, the conditions are always exactly equivalent when planning their own studies, and more important, could
with respect to individual difference variables since the subjects are provide useful information for future meta-analysts who will
the same in the different conditions. People typically vary less combine the results of many related studies.
within themselves (that is, when compared to themselves) than
when compared to others. Another advantage of within-subject 7. Conclusions
design is that it has more statistical power, first because as a result
by using the within-subject design the number of ‘‘subjects’’ has In this paper the methods to calculate effect size from experi-
been in effect increased relative to a between-subject design. For mental data or published research are presented in detail along
example, in the experiment comparing the productivity of four with many examples. It is recommended to include effect size when
temperature conditions [1], since the 24 subjects were repeatedly reporting results of a study. How to determine the right sample size
exposed to the four temperature conditions, it had four times as when planning a study and how to interpret the research results
many ‘‘subjects’’ as it would have if a between-subject design was with power analysis are also illustrated step by step with examples.
used. The second reason is that repeated-measures design removes The examples are expected to help researchers involved in IEQ
the variance due to individual overall differences among subjects. research to better understand power analysis and plan their own
In repeated-measure design, not the actual score, but its difference study as well as evaluate research results. A priori power analysis
from that subject’s mean across conditions is compared. Therefore can be used to figure out proper sample size, therefore it can
the variation due to overall between person tendencies is elimi- provide an efficient method of controlling statistical power before
nated - everyone’s scores only vary from their own mean. Now it a study is actually conducted, while the post hoc power analysis is
can be seen why the t-test for dependent means and one-way important when looking at the results of a research. Non-significant
ANOVA for repeated-measures have more statistical power and result with high statistical power does suggest the research
require fewer sample compared with their independent designs. So hypothesis is false, but a non-significant result with low statistical
within-subject design is recommended for human related investi- power is inconclusive. As to the statistical significant results, the
gations, considering the large individual difference and the diffi- critical question is whether the effect size is large enough for the
culty to recruit a large number of subjects. However it should be results to have practical implications.
noted that there is a fundamental disadvantage of the within-
subject design, which can be referred to as ‘‘carryover effects’’, two
Acknowledgement
basic types of which were practice and fatigue effects, meaning that
participation in one condition may affect performance in other
The project is financially supported by the National Natural
conditions, thus creating a confounding extraneous variable that
Science Foundation of China (No. 50878125). The authors also
would like to thank the anonymous reviewers for their useful and
Table 2 valuable comments on this paper.
Role of significance, sample size, and statistical power in interpreting research study
results.
Appendix A
Outcome statistically Sample size Statistical Conclusion
significance power
Yes Small High Important results
Yes Large Low Might or might not have Hypothesis testing
practical importance
No Small Low Inconclusive Hypothesis testing is the use of statistics to determine the
No Large High Research hypothesis
probability that a given hypothesis is true. The usual process of
probably false
hypothesis testing consists of five steps [29].
L. Lan, Z. Lian / Building and Environment 45 (2010) 1202–1213 1211

sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
(1) Formulate the null hypothesis H0 (commonly, that the obser-
ðn1 1ÞS21 þ ðn2  1ÞS22
vations are the result of pure chance) and the alternative Spooled ¼ (A2)
hypothesis H1 (commonly, that the observations show a real n1 þ n2  2
effect combined with a component of chance variation). This is where x ¼ group mean, S ¼ standard deviation, n ¼ number of
important as mis-stating the hypotheses will muddy the rest of subjects, subscripts 1 and 2 refer to the two groups.
the process. The numerical value of the t-statistic is proportional to the
(2) Consider the assumptions being made in doing the test and probability that the difference between means is statistically
identify a test statistic that can be used to assess the truth of significant. The larger the t-value, the more likely the difference
the null hypothesis; for example, assumptions about the between means is significant. This test is used only when it can be
statistical independence or about the form of the distribu- assumed that the two distributions have the same variance. When
tions of the observations. This is equally important as invalid this assumption is seriously violated, Welch’s t-test should be used.
assumptions will mean that the results of the test are invalid. When there is only one sample that has been tested twice
(3) Compute the P-value, which is the probability, assuming that (repeated measures) or when there are two samples that have been
the null hypothesis is true, of observing a result at least as matched or ‘‘paired’’, the paired t-test is used. In the paired t-test,
extreme as the test statistic. The smaller the P-value indicates the differences between all pairs must be calculated [30,31]. The
the stronger evidence against the null hypothesis. average and standard deviation of those differences are then used
(4) Compare the P-value to an acceptable significance value to calculate the t-statistic.
a (sometimes called an alpha value). Popular levels of signifi-
cance are 5% (0.05) and 1% (0.01).
(5) Decide to either fail to reject the null hypothesis or reject it in Analysis of variance (ANOVA)
favor of the alternative. If P  a, that the observed effect is
statistically significant, the null hypothesis is rejected, and the The statistical procedure for testing differences among the
alternative hypothesis is valid. means of more than two groups is called the analysis of variance
(ANOVA) [15]. There are several types of ANOVA depending on the
When a researcher makes a directional hypothesis, the null number of treatments and the way they are applied to the subjects
hypothesis is also, in a sense, directional. For example, if the in the experiment. Two commonly used are one-way ANOVA and
research hypothesis is m1 > m2, then the null hypothesis is m1  m2. one-way ANOVA for repeated measures. The fixed effects one-way
Thus, to reject the null hypothesis, values of the test statistic had to ANOVA test can be viewed as an extension of the two group t-test
fall into one specified tail of its sampling distribution. For this for a difference of means to more than two groups, and is typically
reason, the test of a directional hypothesis is called one-tailed test. A used to test for differences among at least three groups. One-way
directional hypothesis only considers one tail (the other tail is ANOVA for repeated measures is used when the subjects are sub-
ignored as irrelevant to H1), thus all of a can be placed in that one jected to repeated measures; this means that the same subjects are
tail. When a research predicts an effect but does not predict used for each treatment.
a particular direction for the effect, it is called a non-directional The null hypothesis in an ANOVA is that the several populations
hypothesis. To test the significance of a non-directional hypothesis, being compared all have the same mean. The fundamental tech-
the possibility that the sample could be extreme at either tail of the nique of ANOVA is a partitioning of the total sum of squares (SS)
comparison distribution has to be taken into account. Thus, it is into components related to the effects used in the model [29]. For
called a two-tailed test. A two-tailed test requires us to consider example, for a simplified ANOVA with one type of treatment at
both sides of the H0 distribution, so we split a and place half in each different levels, the total sum of squares can be divided into within-
tail. group SS and between-group SS:
There are two kinds of decision errors in hypothesis testing:
SSTotal ¼ SSWithin þ SSBetween (A3)
Type 1 error and Type 2 error. You make a Type 1 error if you reject
the null hypothesis when in fact the null hypothesis is true.
The number of degrees of freedom (df) can be partitioned in
The significance value a indicates the chance of making a Type 1
a similar way and specifies the chi-square distribution which
error. Type 2 error is the error of failing to reject a null hypothesis
describes the associated sums of the squares.
when it is in fact not true. The probability of making a Type 2 error
is called b.
dfTotal ¼ dfWithin þ dfBetween ; dfBetween ¼ r  1; dfWithin
¼ nr ðA4Þ

The t-test where r ¼ number of groups, n ¼ total number of cases, subscripts


Between and Within refer to between-group and within-group.
The t-tests, especially the unpaired and paired two-sample Then the F-test is used for comparisons of the components of
t-tests are probably the most commonly used test statistic for the total deviation:
hypothesis testing. In simple terms, the t-test compares the actual
difference between two means in relation to the variation in the mean of the between  group variance
F ¼
data. The unpaired t-test is used when two separate independent mean of the within  group variance
and identically distributed samples are obtained, one from each SSBetween =dfBetween
¼ (A5)
of the two populations being compared. The t-statistic to SSWithin =dfWithin
test whether the means are different can be calculated as
follows [15]:
Effect size
x1  x2
t ¼ sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi (A1) Effect size is a measure of the difference between population or
1 1
Spooled $ þ sample means [15]. Term it in another way, effect size measures the
n1 n2 difference between the true value and the value specified by the
1212 L. Lan, Z. Lian / Building and Environment 45 (2010) 1202–1213

null hypothesis and hence indicates whether the difference is level (1  b), the pre-specified significance level a, and the pop-
practical important. Standardized effect size is the name given to ulation effect size to be detected with probability 1  b. In contrast
a family of indices that measure the magnitude of a treatment to a priori power analysis, post hoc power analysis often makes
effect, and is very important because it allows us to compare the sense after a study has already been conducted. In post hoc analysis,
magnitude of experimental treatments from one experiment to 1  b is computed as a function of a, the population effect size
another [15]. In essence, a standardized effect size is the difference parameter, and the sample size N used in a study. In compromise
between two means divided by the standard deviation of the two power analysis, both a and 1  b are computed as functions of the
conditions. It is the division by the standard deviation that enables effect size, sample size N, and the error probability ratio q ¼ b/a. To
us to compare effect size across experiment. Stated as a formula: illustrate, setting q to 1 would mean that the researcher prefers
balanced Type 1 and Type 2 error risks (a ¼ b) whereas a q of 4
q ¼ ðm1  m2 Þ=s (A6) would imply that b ¼ 4a. In sensitivity analysis, the critical pop-
In equation (1), s refers to the standard deviation of the two ulation effect size is computed as a function of a, 1  b, and sample
populations assuming that both populations have the same stan- size N. Finally, criterion analysis computes a (and the associated
dard deviation. In practice, the pooled standard deviation spooled is decision criterion) as a function of 1  b, the effect size and a given
commonly used [10]. The pooled standard deviation is the square sample size.
root of the average of the squared standard deviations:
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ðn1 1Þs21 þ ðn2  1Þs22 References
spooled ¼ (A7)
n1 þ n2 [1] Lan L, Lian ZW, Pan L, Ye Q. Neurobehavioral approach for evaluation of office
workers’ productivity: the effects of room temperature. Building and Envi-
where s1 and s2 are the standard deviation of two populations, n1 ronment 2009;44:1578–88.
and n2 are the number of subjects of each group, respectively. [2] Bakó-Biró Z, Wargocki P, Weschler CJ, Fanger PO. Effects of pollution from
When the two standard deviations and number of subjects are personal computers on perceived air quality, SBS symptoms and productivity
in office. Indoor Air 2004;14:178–87.
similar the root mean square will be not differ much from the [3] Tham KW, Willem HC. Room air temperature affects occupants’ physiology,
simple average of the two variances. perceptions and mental alertness. Building and Environment 2010;45:40–4.
[4] Zhang H, Arens E, Kim DE, Buchberger E, Bauman F, Huizenga C. Comfort,
perceived air quality, and work performance in a low-power task-ambient
conditioning system. Building and Environment 2010;45:29–39.
Statistical power
[5] Tanabe S, Nishihara N. Productivity and fatigue. Indoor Air 2004;14(Suppl.
7):126–33.
In the hypothesis testing, great attention was given to the [6] Rösser N, Berger K, Vomhof P, Knecht S, Breitenstein C, Flöel A. Lack of
significance or the issue of whether a Type 1 error a had been improvement in odor identification by levodopa in humans. Physiology &
Behavior 2008;93:1024–9.
made (that the null hypothesis was mistakenly rejected and some [7] Rawson ES, Lieberman HR, Walsh TM, Zuber SM, Harhart JM, Matthews TC.
effect was being assumed from the results that in fact did not Creatine supplementation does not improve cognitive function in young
exist). But less attention was given to the possibility of a Type 2, or adults. Physiology & Behavior 2008;95:130–4.
[8] Baguley T. Understanding statistical power in the context of applied research.
b error (that the null hypothesis had been mistakenly not rejected Applied Ergonomics 2004;35:73–80.
and a real effect was ignored because of inconclusive results– [9] Cohen J. Statistical power analysis for the behavioural sciences. 2nd ed. New
indeed, often treated as nonexistent). When statistically signifi- Jersey: Hillsdale; 1988.
[10] Faul F, Erdfelder E, Lang AG, Buchner A. G*Power 3: a flexible statistical power
cant differences are calculated a value is set, which establishes the analysis program for the social, behavioral, and biomedical sciences. Behavior
risk of Type 1 error. The statistical power of a study is defined by Research Methods 2007;39:175–91.
1  b, that is, the probability that the study will produce a statis- [11] Mayr S, Erdfelder E, Buchner A, Faul F. Short tutorial of GPower. Tutorials in
Quantitative Methods for Psychology 2007;3(2):51–9.
tically significant result if the research hypothesis is true [2]. For
[12] Hoenig JM, Heisey DM. The abuse of power: the pervasive fallacy of power
example, if the b is 0.1 then there is a 10% chance that two groups calculations for data analysis. The American Statistician 2001;55:19–23.
really are different when a statistical test suggests that the mean [13] Zumbo BD, Hubley AM. A note on misconceptions concerning prospective and
retrospective power. The Statistician 1998;47:385–8.
values for the properties describing the groups are not different.
[14] Cohen J. The statistical power of abnormal-social psychological research:
1  b ¼ 0.9 such that the test power is 0.9. This means that the a review. Journal of Abnormal & Social Psychology 1962;65:145–53.
statistical test has a 90% probability of being correct if it concludes [15] Aron A, Aron EN, Coups E. Statistics for psychology. 4th ed. New York: Prentice
that there is a difference between-groups when a difference Hall; 2006.
[16] Lorsch HG, Ossama AA. The impact of the building indoor environment on
really exists. occupant productivity – part 1: recent studies, measures, and costs. ASHRAE
To be useful, a test must have reasonably small probabilities of Transactions 1994;100(2):741–9.
both Type 1 and Type 2 errors. The Type 1 error is kept small by [17] Rosnow RL, Rosenthal R. Computing contrasts, effect sizes, and counternulls
on other people’s published data: general procedures for research consumers.
choosing a small a value as the significance level. Then the Type 2 Psychological Methods 1996;1:331–40.
error can be controlled by the statistical power (1  b) of the study. [18] Thalheimer W, Cook S. How to calculate effect sizes from published research
If the statistical power is large, the probability of Type 2 error is articles: a simplified methodology. Available from: http://work-learning.com/
effect_sizes.htm; 2002 [Accessed 21.04.09].
small, and the test is a useful one. Popular levels of a value are 5% [19] Kahl JK. Room temperature and task effects on arousal, comfort and perfor-
(0.05) and 1% (0.01). As to the acceptable level of statistical power, mance. Journal of Undergraduate Research 2005;8:1–5.
a widely used rule is that a study at least should have a statistical [20] Raymore LA, Barber BL, Eccles JS. Leaving home, attending college, partnership
and parenthood: the role of life transition events in leisure pattern stability
power of 80% to be worth doing [9]. Obviously, the more statistical
from adolescence to young adulthood. Journal of Youth and Adolescence
power the better. However, the costs of greater statistical power, 2001;30:197–223.
such as studying more people, often make even 80% power beyond [21] Floel A, Hummel F, Breitenstein C, Knecht S, Cohen LG. Dopaminergic
effects on encoding of a motor memory in chronic stroke. Neurology 2005;65:
researcher’s reach.
472–4.
Five different types of power analysis can be distinguished with [22] Lan L, Lian ZW, Liu WW, Liu YM. Investigation of gender difference in thermal
respect to the available resources, the actual phase of the research comfort t for Chinese people. European Journal of Applied Physiology
process, and the specific research question [10]. A priori power 2008;102:471–80.
[23] Yao Y, Lian ZW, Liu WW, Jiang CX, Liu Y, Lu H. Heart rate variation and elec-
analysis is done before a study takes place. In a priori power anal- troencephalograph – the potential physiological factors for thermal comfort
ysis, sample size N is computed as function of the required power study. Indoor Air 2009;19:93–101.
L. Lan, Z. Lian / Building and Environment 45 (2010) 1202–1213 1213

[24] Lan L, Lian ZW. Use of neurobehavioral tests to evaluate the effects of indoor [28] Tsutsumi H, Tanabe S. Effect of humidity on human comfort and productivity
environment quality on productivity. Building and Environment 2009;44(11): after step changes from warm and humid environment. Building and Envi-
2208–17. ronment 2007;42:4034–42.
[25] Clements-Croome DJ, Kaluarachchi Y. Assessment and measurement of [29] Wang RX. Mathematical statistics. Xi’an: Xi’an Jiao Tong University Press;
productivity. In: Clements-Croome DJ, editor. Creating the productive work- 2006.
place. London: Taylor & Francis; 2001. p. 127–65. [30] Zimmerman DW. A note on interpretation of the paired-samples t-test.
[26] Tomas L, Krebs CJ. A review of statistical power analysis software. Bulletin of Journal of Educational and Behavioral Statistics 1997;22(3):349–
the Ecological Society of America 1997;78:126–39. 60.
[27] Box GEP, Hunter JS, Hunter WG. Statistics for experimenters design, innova- [31] David HA, Gunnink JL. The paired t-test under artificial pairing. The American
tion, and discovery. 2nd ed. New York: Wiley-Interscience; 2005. Statistician 1997;51(1):9–12.

You might also like