You are on page 1of 7

Hypothesis Testing and Confidence Intervals

T R Konold and X Fan, University of Virginia, Charlottesville, VA, USA


ã 2010 Elsevier Ltd. All rights reserved.

Hypothesis testing involves estimating the probability (p) The H0 is often stated to reflect the general state of affairs,
of observing a sample statistic (e.g., sample mean) that is or the condition of no difference. In hypothesis testing,
equal to or more extreme than what is obtained from a assuming that the null hypothesis is true, we evaluate the
sample, assuming that the sample is drawn from a popu- probability of observing a sample statistic (e.g., sample
lation with a hypothesized population parameter value mean) equal to or more extreme than what was obtained
(e.g., population mean, m). This article provides an over- from our sample drawn from the hypothesized population
view of hypothesis testing through both point and interval with the population value (population parameter) stated
estimation approaches. Although we use a population in the H0 (i.e., m ¼ 72). In most substantive applications,
mean in some examples, the general principles discussed researchers hope to demonstrate that this probability is
here extend to situations involving other sample statistics small, and the observed sample estimate is unlikely to
(multiple sample means, variance, proportion, correlation have arisen from a population as hypothesized in the H0.
coefficient, etc.). Sampling distribution, sampling error, Although this example concerns a hypothesis about a
point estimate, interval estimate, statistical errors, and population mean m, hypotheses concerning any popula-
statistical power are among the topics discussed below. tion parameters (e.g., variance, correlation coefficient, and
proportion) are essentially the same. So we may express
H0 in a generic form:
Research Questions and Hypotheses
H0 : y ¼ K
Research questions and hypotheses are usually based on a where y is the population parameter under investigation,
researcher’s educated guess or intuition grounded in a good and K is the hypothesized value of that parameter.
understanding of the phenomenon of interest. For exam-
ple, a teacher might ponder whether a new instructional
approach has an effect on student achievement as measured Directional and Nondirectional Alternative
on a state-wide assessment of learning (the research ques- Hypotheses
tion). Based on his/her pedagogical content knowledge, Hypothesis testing involves two statistical hypotheses.
the teacher may generate a research hypothesis that this The first is the null hypothesis (H0) as described above.
new instructional approach does impact student achieve- For each H0, there is an alternative hypothesis (Ha) that
ment. Although this research hypothesis does not indicate will be favored if the null hypothesis is found to be
the direction of influence, such qualifications can easily be statistically not viable. The Ha can be either nondirec-
incorporated to reflect whether the researcher believes tional or directional, as dictated by the research hypothe-
that this instructional approach will increase or decrease sis. For example, if a researcher only believes the new
measured achievement. In general, a research hypothesis instructional approach will have an impact on student test
reflects a researcher’s predictions and it directly shapes scores, but is unsure whether the effect will be positive or
the statistical hypotheses to be empirically evaluated. negative, the null and alternative hypotheses would be
H0: m ¼ 72
Statistical Hypothesis
Ha: m 6¼ 72
Null Hypothesis
Here, Ha reflects the researcher’s uncertainty regarding
In order to test a research hypothesis, it is necessary to the directionality, and it allows for a statistical test that
translate it into a statistical hypothesis. A statistical considers both possibilities that the new instructional
hypothesis is a numerical statement regarding a hypothe- approach could increase test scores or decrease test scores.
sized population parameter. For example, knowing that This is commonly referred to as a nondirectional alterna-
the average state-wide assessment score for the general tive hypothesis, and is also referred to as a two-tailed test
student population is 72 raw score points, a researcher for reasons that are described below.
might state the null hypothesis (H0) as A directional alternative hypothesis, on the other hand,
is useful to accommodate the researcher’s prediction that,
H0 : m ¼ 72 for example, the new instructional approach will decrease

216
Hypothesis Testing and Confidence Intervals 217

test scores (Ha: m < 72 bpm) or will increase test scores both point and interval estimation is in estimating the
(Ha: m > 72). A directional alternative hypothesis is often sampling variability of a given sample statistic.
referred to as a one-tailed test as described below. It is
important to note, however, that for every specified H0
there will be a single Ha that may assume one of the three Sampling Distribution
forms
Ha: y 6¼ K In hypothesis testing, sampling variability (i.e., sampling
error) of a sample statistic y^ is the foundation for esti-
Ha: y < K mating the probability of observing a sample statistic
Ha: y > K equal to or more extreme than what is obtained under
the true null hypothesis. A statistic’s sampling distribution
provides the estimation for sampling error. For example,
assume we are interested in the average performance on a
Sample, Population, and Inferential math achievement exam (parameter) of all fifth graders in
Statistics the United States (population). Our best estimate of the
unknown population mean parameter (m) would be the
A sample is a subset of a population of interest that is sample average (X ) based on a random sample of n ob-
defined by the researcher. For example, one could con- servations from this population. If this process of drawing
ceivably collect math achievement data on all fifth graders random samples of size n from the same population were
in the United States. However, given the limited re- repeated many times, the means from all the samples
sources, it is more realistic and economical to work with would vary, and these means themselves would form a
a subset of this group (i.e., sample), with the goal of distribution, which is called a sampling distribution of the
generalizing the findings to the total population. It is mean. This sampling distribution concept also extends to
possible, or even very common, that the population of other sample statistics (e.g., sample variance, proportion,
interest is a statistical population that does not really and correlation).
exist. For example, our educator interested in the effec- This sampling distribution captures the sample-to-
tiveness of a new instructional approach may statistically sample variability of a sample statistic. In the case of X ,
compare two student samples: one sample receiving this theoretically, the average of all possible sample means, or
new approach, and the other receiving the business- the expected value of the statistic, E(X ), is equal to the
as-usual approach. In this situation, there are two hypo-   mean (i.e., m). Statistics with this property, that
population
thetical statistical populations: one population of students is, E y^ ¼ y, are referred to as unbiased estimators.
taught under this new approach, and another population
of students taught under the business-as-usual instruc-
tional approach. In reality, we do not really have these two Standard error
populations. Sample-to-sample variability of a statistic, or sampling
A sample statistic y^ is a measure of a sample character- error, can be quantified by standard error. In the case of
istic (sample mean, proportion, correlation, etc.), whereas sample means as described above, the standard error of
a measure of the corresponding population characteristic the mean quantifies the variability of the sampling distri-
is a population parameter (y). Researchers are typically bution of the mean. When samples are randomly selected
interested in understanding the characteristics of a popu- from a population, the resulting distribution will be ap-
lation, but it is often impossible or impractical to obtain proximately normally distributed for reasonable sample
data from the whole population (e.g., lack of resources to size of n (see the subsection titled ‘Central limit theorem’
do so; or a hypothesized statistical population). Inferential below). When the population standard deviation (s) is
statistics are useful for estimating population character- known, the standard deviation of the sampling distribu-
istics (i.e., parameters), and for generalizing the sample tion of means, usually referred to as standard error of the
findings to the population from which the sample was mean, is:
drawn. s
sX ¼ pffiffiffi
Hypothesis testing is at the center of inferential statis- n
tics. Both point estimation and interval estimation are
commonly used first steps in the hypothesis testing pro- As is obvious, sX is a function of the sample size (n). As n
cess. In point estimation, a single population parameter increases, sX decreases to its lower limit of 0 (i.e.,
value is hypothesized and used in hypothesis testing. By n ! 1; sX ! 0). Likewise, as sample size decreases, sX
contrast, interval estimation serves as a useful supplement increases to its upper limit of the population standard
to point estimation because it provides a likely range of deviation s (i.e., n ! 1; sX ! s). In general, over
values for the unknown population parameter. The key to repeated sampling, sample statistic y’s^ based on a larger
218 Statistics

sample size has less variability (i.e., less sampling error) than what is obtained from a sample, assuming true H0.
than those based on a smaller sample size. For this purpose, the sampling distribution of the statistic
developed under H0 serves as the probability distribution.
Under this probability distribution, the closer the observed
Central limit theorem ) is to the population parameter
sample statistic (e.g., X
Several properties of sampling distributions that have
(e.g., m ¼ K) under H0, the more likely that the difference
important implications for hypothesis testing are captured  and m may have been the result of sampling
between X
in the central limit theorem (CLT). In a nutshell, CLT states
variation (i.e., sampling error). Sample statistic values near
the following. First, as the size (n) of the samples drawn
the tails of the sampling distribution represent a greater
from a population increases, the standard error of the means
difference between the sample statistic and the population
decreases, leading to less variability of sample means and
parameter under H0. These values occur less frequently in
smaller sampling error. Figure 1 presents two sampling
repeated sampling, and consequently, the probability of
distributions of the means, one based on n ¼ 10, and the
observing the sample statistic values near the tails is smal-
other based on n ¼ 30. As shown in Figure 1, the distri-
ler. Under H0, if this probability is judged to be smaller
bution based on a smaller sample size (n ¼ 10) spreads out
than a predetermined threshold probability level (denoted a),
more than the one based on a larger sample size (n ¼ 30).
we will conclude that H0 is likely not true; if it were, it
Second, the shape of the sampling distribution of the
would be very unlikely to observe the sample statistic.
mean becomes increasingly normal as the sample size
Conventionally, a ¼ 0.05 is often used as such a threshold,
increases. This is true regardless of the shape of the parent
leading to the following decision rules with regard to H0:
population from which the samples are drawn. As a result,
 can generally be consid-  
the sampling distribution of X ^ 0  a ¼ 0:05 ! reject H0
If p yjH
ered as being normally distributed for samples of size
 
n ¼ 30 or greater, even though the parent population ^ 0 > a ¼ 0:05 ! fail to reject H0
If p yjH
may not be normally distributed.
This rule implies that when the observed sample statistic
Sampling distribution as probability distribution value becomes so rare under the H0 sampling distribution
Hypothesis testing involves estimating the probability of (e.g., p < 0.05), we would be willing to conclude that the
^ equal to or more extreme
observing a sample statistic (y) value specified in the H0 is not a plausible population
value for the observed sample statistic. It should be
noted that this rule does not suggest that such sample
statistic values cannot occur under H0; it only states that
values so extreme occur less frequently under H0. More-
n = 10 over, there is nothing magical about the conventional
a ¼ 0.05 decision rule: researchers may choose other values
for evaluating H0, based on their considerations about
type I and type II errors (described below). It is important
to note that, given the observed sample statistic, hypothe-
sis testing is unable to address the question of whether H0
is true. The process only involves the determination of the
probability of obtaining sample statistic values equal to or
more extreme than the sample result under the assump-
tion that H0 is true (Cohen, 1994).

n = 30
Test Statistics and Probability Estimates

In hypothesis testing, to assess the probability of observ-


ing values more extreme than a given sample statistic
under H0, we translate the observed sample statistic value
into a test statistic. For example, the test statistic for
evaluating a single sample mean (X ) when the population
standard deviation (s) is known is:
 m
X
Figure 1 Two sampling distributions of the mean from the Zobs ¼
s
same population: n ¼ 10 and n ¼ 30. X
Hypothesis Testing and Confidence Intervals 219

df = 5 df = 15
df = ⬁

Figure 2 Three t-distributions with df ¼ 5, df ¼ 15, and df ¼ 1.

This zobs test statistic represents the distance between the H0, against a predetermined threshold probability level
sample mean and hypothesized population mean in stan- (i.e., a ¼ 0.05). In practice, this is done through a direct
dard deviation units, where the standard deviation is the comparison of the observed sample test statistic (e.g., zobs,
standard deviation of the sampling distributionpof ffiffiffi the tobs, Fobs) against the critical value of the statistic (e.g., zcv,
mean (i.e., standard error of the mean: ðsX ¼ ðs= nÞ. tcv, or Fcv) as determined by the predetermined threshold
A larger absolute value of zobs is further away from the probability level a. For example, the critical value of
hypothesized population mean. The exact probability of t (i.e., tcv) associated with a given value of a and a given
observing a value of zobs or larger can be obtained from a df is readily available from a table of t distributions. By
normal distribution table. comparing tobs against tcv, we can establish the following
Application of the z test statistic requires that the decision rule for testing H0:
population standard deviation s be known. When s is
unknown and must be estimated with the sample standard lf jtobs j  jtcv j ! reject H0
deviation (s), the estimated standard error of the mean is lf jtobs j < jtcv j ! fail to reject H0
sX ¼ ps ffiffin, and the test statistic becomes:
This decision rule is a direct translation of our previous
 m
X probability-based decision rule for testing H0:
tobs ¼
SX  
^ 0  a ! reject H
lf p yjH 0
In contrast to the single z distribution, there exists a  
^ 0 > a ! fail to reject H
lf p yjH
family of t distributions, as defined by degrees of freedom 0
(df ) that are based on sample size (df ¼ n–1). Figure 2
For example, our educator researcher previously hypo-
presents three t distributions with different df (df ¼ 5, 15,
thesized mean student achievement test score of 72 for
or 1). Because the t distributions have different shapes,
a population (H0: m ¼ 72), and she would like to test this
probabilities beyond a given t value depend not only on
H0 against the nondirectional alternative hypothesis Ha:
the value of t, but also on the df on which that statistic is
m 6¼ 72 for a ¼ 0.05. From a random sample of 100
based. As sample size n increases, t distributions converge  ¼ 68, and s ¼ 8
participants, the researcher obtained X
on the z distribution. Beyond df ¼ 120, the difference
(i.e., sX ¼ ps ffiffin ¼ 0:8). The test statistic is:
between z and t distributions is negligible, and a t-test
practically becomes a z-test.   m 68  72
X
Test statistics (e.g., F statistic, w2 statistic) other than z tobs ¼ ¼ ¼ 5
s 0:8
X
and t may be needed as required by a statistical analysis.
Regardless of the specific test statistic used in a research As | tobs | ¼ 5 exceeds the tcv [t(.975, df ¼ 99) ¼ 2.28, a ¼ .05],
situation, the logic and procedure described here for the H0: m ¼ 72 is rejected, as it is very unlikely ( p < 0.05)
hypothesis testing remain the same. to observe sample mean values equal to or greater than
 ¼ 68 if H0 were true.
X
Either a nondirectional or directional alternative
hypothesis may be evaluated. When a nondirection alternative
Decision Rules Based on Test Statistic
hypothesis is evaluated (i.e., Ha: m 6¼ K) , the overall a (e.g.,
The discussion above focused on testing H0 by comparing a ¼ 0.05) is evenly divided into the two tails of the sampling
the probability of obtaining sample statistic values more distribution of H0, to accommodate the possibility that the
extreme than a given sample statistic, assuming the true direction of difference between X  and m can be either
220 Statistics

a/2 a/2

a
Figure 3 The a at distribution tails for nondirectional (two-tailed) and directional (upper tail) tests.

positive or negative, as shown in Figure 3 (two solid lines the guidelines in the APA publication manual as best
at the tails). practice in reporting empirical results (APA, 2001).
When a directional alternative hypothesis is used (e.g., For a hypothesized population value y, using the sam-
Ha: m > K ), we only consider the critical value at one tail ^ and for a given a, the following CI can be
ple estimate y,
that corresponds to the hypothesized direction in Ha, and constructed:
the overall a is located at this tail, as shown in Figure 3
(dashed line at the right tail for Ha: m >K ). Nondirectional 100ð1  aÞC:I : ¼ y^  Tð1aÞ Sy^
2

and directional alternative hypotheses are often referred


to as two-tailed and one-tailed tests, respectively. For a (This is the classical definition of CI under true H0, where
directional (one-tailed) test, because all a is located at one the critical values from a central distribution (e.g., central
tail, the critical value of the test statistic will be smaller in t distribution) are used to construct a symmetrical CI.
absolute value than that in a nondirectional test. For example, However, if a true H0 is not assumed, critical values from
for a z-test with a ¼ 0.05, the critical values are 1.96 for a noncentral distribution (e.g., noncentral t distribution)
a two-tailed test (nondirectional test), and the critical value may be used for constructing the CI for the unknown
for a one-tailed (directional) test (Ha: m > K) is only 1.65. population parameter y, and this will result in a nonsym-
This makes it easier for a one-tailed test to reject H0, as metrical CI. For more details, readers may consult, e.g.,
long as the test statistic is in the hypothesized direction. Cumming and Finch (2001), Steiger and Fouladi (1997).
Because of this, if the test statistic is in the hypothesized where Tð1aÞ is the appropriate test statistic (e.g., z, t)
2
 
direction, the one-tailed test has more statistical power (see value at the 100 1  a2 percentile point of its distribu-
description of statistical power below) over the two-tailed test. s ¼ps ffi
tion, and sy^ is the estimated standard error (i.e., y^ n )
When the test statistic is not in the hypothesized direction,
of the sampling distribution. Adding and subtracting the
however, the test automatically fails to reject the H0 .
quantity of Tð1aÞsy^ from y^ results in the lower and upper
2 
Confidence Intervals in Hypothesis limits of a CI y^  Tð1aÞ Sy^; y^ þ Tð1aÞ Sy^ that has
2 2

Testing probability of 1 – a of containing the unknown popula-


tion parameter value y. If the hypothesized population
The hypothesis testing approach described above is a parameter value under H0 is outside this CI, it leads to
point-estimate approach that results in a dichotomous the rejection of H0. Conversely, if the CI contains the
decision (i.e., reject or fail to reject H0). On the other parameter value under H0, H0 will not be rejected.
hand, interval-estimation approach provides a range of Using our previous example about a new instructional
values within which a population parameter is likely to method, the researcher obtained n ¼ 100, X  ¼ 68, and
reside. The importance of using confidence intervals (CIs) s ¼ 8 (i.e., sX ¼ pffiffin ¼ 0:8). For a ¼ 0.05, t(.975, df ¼ 99) ¼
s

in educational and psychological research has been 2.28, the 95% CI would be (66.18, 69.82). (Lower
emphasized by the American Psychological Association limit is 68 – (2.28  0.8) ¼ 66.18; upper limit is 68 þ
(APA) task force (Wilkinson and Task Force on Statistical (2.28  0.8) ¼ 69.82.) As this CI does not contain 72 under
Inference, 1999), and such emphasis has been reflected in H0, the H0 would be rejected. With a CI, we are not only
Hypothesis Testing and Confidence Intervals 221

Decision based on sample data


Reject H0 Fail to reject H0

True state of nature


Type I error Correct decision
H0 true
a 1−a

Correct decision
Type II error
H0 false (power)
b
1−b

Figure 5 Hypothesis testing errors and correct decisions.

H0 Ha

Power
(1−b)
(1−a)
Y
m
b (type II) a (type I)
Figure 4 Fifty 95% confidence intervals for population
parameter m. Figure 6 Type I, type II errors and statistical power.

able to evaluate the H0 (point estimation), but also able to In a hypothesis test, once a decision is made with
estimate a range of probable population parameter values regard to H0, either a type I error (when H0 is rejected),
(interval estimation). The value of a CI is that, it provides or a type II error (when H0 is not rejected), is possible, but
information not only about the sample-based point-estimate, not both. We cannot know with certainty whether the
but also about the estimation error associated with the decision is actually an error. When H0 is rejected, only a
statistic. In the social sciences, the resulting estimation type I error is possible. The probability for a type I error is
error may be large, and this may explain why such inter- known, because it is directly controlled by the researcher
vals are not often reported (Cohen, 1994). through the choice of the a level (e.g., a ¼ 0.05), which
It is important to note that this 95% CI does not indicate represents the risk we are willing to take in concluding
that we can be 95% confident that this interval contains that the H0 is false when it may be true. On the other hand,
the population parameter (see Thompson, 2007: 427). Rather, when H0 is not rejected, only a type II error is possible.
this is interpreted to mean that 95% of similarly con- Here, the probability for a type II error is unknown, but
structed intervals will contain the population parameter, can be estimated, and is generally not directly controlled
while 5% of them will not. This is illustrated in Figure 4 by the researcher. In Figure 5, when H0 is true (upper row),
for a hypothetical situation where 50 random samples the probability for a type I error is a, and the probability
were drawn from a population with parameter m, and a of making a correct decision (i.e., not to reject the true H0)
95% confidence interval was constructed from each sam- is 1 – a. When H0 is false (lower row), the probability for a
ple. Out of the 50 CIs, 47 contained the population type II error is b, and the probability of making a correct
parameter, and three did not. decision (i.e., to reject the false H0) is the power of a
statistical test (1 – b).
Figure 6 graphically illustrates two sampling distribu-
tions under H0 and Ha, respectively, as well as the relation-
Statistical Errors (Type I and Type II) ships among the probabilities of type I and type II errors,
and Power and the statistical power (probability to reject a false H0).
The sampling distribution under H0 represents all possible
In hypothesis testing, two types of statistical errors may values of a statistic, including those in the specified region
occur: reject H0 when it is true (type I error), or fail to of a (vertically shaded area under H0 to the left of the
reject H0 when it is false (type II error). On the other vertical dashed line) where a decision to reject H0 would
hand, a correct decision is made when a true H0 is not be made. This probability is often set at 0.05 to indicate that
rejected, or when a false H0 is rejected. These possible there is a 5% chance of rejecting a true H0 (i.e., making
outcomes in hypothesis testing are shown in Figure 5. a type I error).
222 Statistics

Now consider the sampling distribution under Ha, the H0 and Ha distributions. Researchers need to under-
which overlaps with that under H0. Sample-based test stand such interplay among type I error, type II error, and
statistics in the type II error or b region (horizontally statistical power, and consider the relative cost and conse-
shaded area under Ha to the right of the vertical dashed quence of one type of error versus another in a given
line) would not lead to the rejection of H0, even though substantive research application.
these sample statistics may in fact belong to the sampling
distribution of Ha. As shown in Figure 6, statistical power See also: Point Estimation Methods with Applications to
(i.e., the probability to reject a false H0) denoted by 1 – b, Item Response Theory Models; Sampling; Statistical
is the area under Ha to the right side of the vertical dashed Power Analysis; Statistical Significance Versus Effect
line defined by a. Size.
Education researchers may desire to have a reasonable
level of power (e.g., b  0.80) for a statistical test. One way
to increase statistical power is to increase a (e.g., from 0.05
Bibliography
to 0.10). Imagine the rejection line (vertical dashed line) in
Figure 6 sliding to the left to accommodate a higher a
APA (American Psychological Association) (2001). Publication Manual of
level (i.e., higher probability of making a type I error), and the American Psychological Association, 5th edn. Washington, DC:
this would result in reducing type II error probability (b) Author.
and in increasing statistical power (1 – b). Conversely, Cohen, J. (1994). The earth is round (p < 05). American Psychologist
49, 997–1003.
decreasing a (e.g., from 0.05 to 0.01; sliding the rejection Cumming, G. and Finch, S. (2001). A primer on the understanding, use,
line under H0 distribution to the right) will result in an and calculation of confidence intervals that are based on central and
increase in type II error probability (b) and a decrease in non-central distributions. Educational and Psychological
Measurement 61, 532–574.
statistical power. Statistical power can also be increased by Steiger, J. H. and Fouladi, R. T. (1997). Noncentrality interval estimation
using larger sample sizes. Other things being equal, larger and the evaluation of statistical models. In Harlow, L. L., Mulaik, S. A.,
sample sizes lead to smaller standard errors of the distri- and Steiger, J. H. (eds.) What if there were no significance tests?,
pp 221–257. Mahwah, NJ: Erlbaum.
bution. Consequently, there will be less overlap between Thompson, B. (2007). Effect sizes, confidence intervals, and confidence
the distribution under H0 and that under Ha, thus a smal- intervals for effect sizes. Psychology in the Schools 44, 423–432.
ler b, and more power (i.e., 1 – b is larger). Finally, an Wilkinson, L. Task Force on Statistical Inference, American
Psychological Association, Science Directorate (1999). Statistical
increase in statistical power can also come by way of methods in psychology journals: Guidelines and explanations.
greater treatment effects or greater separation between American Psychologist 54, 594–604.

You might also like