You are on page 1of 29

Hypothesis Testing

A statistical hypothesis is an assumption about a population parameter. This

assumption may or may not be true.

The best way to determine whether a statistical hypothesis is true would be to

examine the entire population. Since that is often impractical, researchers typically
examine a random sample from the population. If sample data are not consistent with
the statistical hypothesis, the hypothesis is rejected.

There are two types of statistical hypotheses.

 Null hypothesis. The null hypothesis, denoted by H0, is usually the hypothesis
that sample observations result purely from chance.

 Alternative hypothesis. The alternative hypothesis, denoted by H1 or Ha, is the

hypothesis that sample observations are influenced by some non-random cause.

For example, suppose we wanted to determine whether a coin was fair and
balanced. A null hypothesis might be that half the flips would result in Heads and half, in
Tails. The alternative hypothesis might be that the number of Heads and Tails would be
very different. Symbolically, these hypotheses would be expressed as

H0: P = 0.5
Ha: P ≠ 0.5

Suppose we flipped the coin 50 times, resulting in 40 Heads and 10 Tails. Given
this result, we would be inclined to reject the null hypothesis. We would conclude, based
on the evidence, that the coin was probably not fair and balanced.

Hypothesis Tests

Statisticians follow a formal process to determine whether to reject a null hypothesis,

based on sample data. This process, called hypothesis testing, consists of four steps.

 State the hypotheses. This involves stating the null and alternative hypotheses.
The hypotheses are stated in such a way that they are mutually exclusive. That
is, if one is true, the other must be false.

 Formulate an analysis plan. The analysis plan describes how to use sample
data to evaluate the null hypothesis. The evaluation often focuses around a
single test statistic.

 Analyze sample data. Find the value of the test statistic (mean score,
proportion, t-score, z-score, etc.) described in the analysis plan.
 Interpret results. Apply the decision rule described in the analysis plan. If the
value of the test statistic is unlikely, based on the null hypothesis, reject the null

Decision Rules

The analysis plan includes decision rules for rejecting the null hypothesis. In
practice, statisticians describe these decision rules in two ways - with reference to a P-
value or with reference to a region of acceptance.

 P-value. The strength of evidence in support of a null hypothesis is measured by

the P-value. Suppose the test statistic is equal to S. The P-value is the
probability of observing a test statistic as extreme as S, assuming the null
hypothesis is true. If the P-value is less than the significance level, we reject the
null hypothesis.
 Region of acceptance. The region of acceptance is a range of values. If the test
statistic falls within the region of acceptance, the null hypothesis is not rejected.
The region of acceptance is defined so that the chance of making a Type I error
is equal to the significance level.

The set of values outside the region of acceptance is called the region of
rejection. If the test statistic falls within the region of rejection, the null
hypothesis is rejected. In such cases, we say that the hypothesis has been
rejected at the α level of significance.

These approaches are equivalent. Some statistics texts use the P-value
approach; others use the region of acceptance approach. In subsequent lessons, this
tutorial will present examples that illustrate each approach.


One-Tailed and Two-Tailed Tests

A test of a statistical hypothesis, where the region of rejection is on only one side
of the sampling distribution, is called a one-tailed test. For example, suppose the null
hypothesis states that the mean is less than or equal to 10. The alternative hypothesis
would be that the mean is greater than 10. The region of rejection would consist of a
range of numbers located on the right side of sampling distribution; that is, a set of
numbers greater than 10.

A test of a statistical hypothesis, where the region of rejection is on both sides of

the sampling distribution, is called a two-tailed test. For example, suppose the null
hypothesis states that the mean is equal to 10. The alternative hypothesis would be that
the mean is less than 10 or greater than 10. The region of rejection would consist of a
range of numbers located on both sides of sampling distribution; that is, the region of
rejection would consist partly of numbers that were less than 10 and partly of numbers
that were greater than 10.


The significance level of a test is the probability that the test statistic will reject
the null hypothesis when the [hypothesis] is true. Significance is a property of the
distribution of a test statistic, not of any particular draw of the statistic. The significance
level is usually denoted by the Greek symbol α (lower case alpha). Popular levels of
significance are 5% (0.05), 1% (0.01) and 0.1% (0.001). If a test of significance gives a
p-value lower than the α-level, the null hypothesis is rejected. Such results are
informally referred to as 'statistically significant'. For example, if someone argues that
"there's only one chance in a thousand this could have happened by coincidence," a
0.001 level of statistical significance is being implied. The lower the significance level,
the stronger the evidence required. Choosing level of significance is an arbitrary task,
but for many applications, a level of 5% is chosen, for no better reason than that it is
In some situations it is convenient to express the statistical significance as 1 − α.
In general, when interpreting a stated significance, one must be careful to note what,
precisely, is being tested statistically.

Different α-levels trade off countervailing effects. Smaller levels of α increase

confidence in the determination of significance, but run an increased risk of failing to
reject a false null hypothesis (a Type II error, or "false negative determination"), and so
have less statistical power. The selection of an α-level thus inevitably involves a
compromise between significance and power, and consequently between the error and
the Type II error. More powerful experiments - usually experiments with more subjects
or replications - can obviate this choice to an arbitrary degree.

In some fields, for example nuclear and particle physics, it is common to express
statistical significance in units of "σ" (sigma), the standard deviation of a Gaussian
distribution. A statistical significance of "nσ" can be converted into a value of α via use
of the error function:

The use of σ implicitly assumes a Gaussian distribution of measurement values.

For example, if a theory predicts a parameter to have a value of, say, 100, and one
measures the parameter to be 109 ± 3, then one might report the measurement as a
"3σ deviation" from the theoretical prediction. In terms of α, this statement is equivalent
to saying that "assuming the theory is true, the likelihood of obtaining the experimental
result by coincidence is 0.27%" (since 1 − erf (3/√2) = 0.0027).

Fixed significance levels such as those mentioned above may be regarded as

useful in exploratory data analyses. However, modern statistical advice is that, where
the outcome of a test is essentially the final outcome of an experiment or other study,
the p-value should be quoted explicitly. And, importantly, it should be quoted whether
the p-value is judged to be significant. This is to allow maximum information to be
transferred from a summary of the study into meta-analyses.


The critical value(s) for a hypothesis test is a threshold to which the value of the
test statistic in a sample is compared to determine whether or not the null hypothesis is

The critical value for any hypothesis test depends on the significance level at
which the test is carried out, and whether the test is one-sided or two-sided.

The six-step methodology of the Critical Value Approach to hypothesis testing is as


(Note: The methodology below works equally well for both one-tail and two-tail
hypothesis testing.)

State the Hypotheses

1. State the null hypothesis, H0, and the alternative hypothesis, H1.

Design the Study

2. Choose the level of significance, α according to the importance of the risk or
committing Type I errors. Determine the sample size, n, based on the resources
available to collect the data.
3. Determine the test statistic and sampling distribution. When the hypotheses involve
the population mean, μ, the test statistic is z when σ is known and t when σ is not
known. These test statistics follow the normal distribution and the t-distribution
4. Determine the critical values that divide the rejection and non-rejection regions.
Note: For ethical reasons, the level of significance and critical values should be
determined prior to conducting the test. The test should be designed so that the
predetermined values do not influence the test results.

Conduct the Study

5. Collect the data and compute the test statistic.
Draw Conclusions
6. Evaluate the test statistic and determine whether or not to reject the null hypothesis.
Summarize the results and state a managerial conclusion in the context of the problem.

A phone industry manager thinks that customer monthly cell phone bills have increased
and now average over $52 per month. The company asks you to test this claim. The
population standard deviation, σ, is known to be equal to 10 from historical data.

The Hypotheses
1.H0: μ ≤ 52
H1: μ > 52

Study Design
2. After consulting with the manager and discussing error risk, we choose a level of
significance, α, of 0.10. Our resources allow us to sample 64 sample cell phone bills.
3. Since our hypothesis involves the population mean and we know the population
standard deviation, our test statistic is z and follows the normal distribution.
4. In determining the critical value, we first recognize this test as a one-tail test since the
null hypothesis involves an inequality, ≤. Therefore the rejection region is entirely on the
side of the distribution greater than the historic mean - right tail.
We want to determine a z-value for which the area to the right of that value is 0.10, our
α. We can use the cumulative normal distribution table (which gives areas to the left of
the z-value) and find z having value 0.90 = 1.285. This is our critical value.

The Study
5. We conduct our study and find that the mean of the 64 sample cell phone bills is
53.1. We compute the test statstic, z = (xbar-μ)/(σ/√n) = (53.1-52)/(10/√64) = 0.88.
6. Since 0.88 is less than the critical value of 1.285, we do not reject the null hypothesis.
We report to the company that, based on our testing, there is not evidence that the
mean cell phone bill has increased from $52 per month.


When a statistic is significant, it simply means that you are very sure that the
statistic is reliable. It doesn't mean the finding is important or that it has any decision-
making utility.

For example, suppose we give 1,000 people an IQ test, and we ask if there is a
significant difference between male and female scores. The mean score for males is 98
and the mean score for females is 100. We use an independent group’s t-test and find
that the difference is significant at the .001 level. The big question is, "So what?” The
difference between 98 and 100 on an IQ test is a very small small, in
fact, that it’s not even important.
Then why did the t-statistic come out significant? Because there was a large
sample size. When you have a large sample size, very small differences will be
detected as significant. This means that you are very sure that the difference is real
(i.e., it didn't happen by fluke). It doesn't mean that the difference is large or important. If
we had only given the IQ test to 25 people instead of 1,000, the two-point difference
between males and females would not have been significant.

Significance is a statistical term that tells how sure you are that a difference or
relationship exists. To say that a significant difference or relationship exists only tells
half the story. We might be very sure that a relationship exists, but is it a strong,
moderate, or weak relationship? After finding a significant relationship, it is important to
evaluate its strength. Significant relationships can be strong or weak. Significant
differences can be large or small. It just depends on your sample size.

Many researchers use the word "significant" to describe a finding that may have
decision-making utility to a client. From a statistician's viewpoint, this is an incorrect use
of the word. However, the word "significant" has virtually universal meaning to the
public. Thus, many researchers use the word "significant" to describe a difference or
relationship that may be strategically important to a client (regardless of any statistical
tests). In these situations, the word "significant" is used to advise a client to take note of
a particular difference or relationship because it may be relevant to the company's
strategic plan. The word "significant" is not the exclusive domain of statisticians and
either use is correct in the business world. Thus, for the statistician, it may be wise to
adopt a policy of always referring to "statistical significance" rather than simply
"significance" when communicating with the public.

One-Tailed and Two-Tailed Significance Tests

One important concept in significance testing is whether you use a one-tailed or

two-tailed test of significance. The answer is that it depends on your hypothesis. When
your research hypothesis states the direction of the difference or relationship, then you
use a one-tailed probability. For example, a one-tailed test would be used to test these
null hypotheses: Females will not score significantly higher than males on an IQ test.
Blue collar workers are will not buy significantly more product than white collar workers.
Superman is not significantly stronger than the average person. In each case, the null
hypothesis (indirectly) predicts the direction of the difference. A two-tailed test would be
used to test these null hypotheses: There will be no significant difference in IQ scores
between males and females. There will be no significant difference in the amount of
product purchased between blue collar and white collar workers. There is no significant
difference in strength between Superman and the average person. The one-tailed
probability is exactly half the value of the two-tailed probability.

There is a raging controversy (for about the last hundred years) on whether or
not it is ever appropriate to use a one-tailed test. The rationale is that if you already
know the direction of the difference, why bother doing any statistical tests. While it is
generally safest to use two-tailed tests, there are situations where a one-tailed test
seems more appropriate. The bottom line is that it is the choice of the researcher
whether to use one-tailed or two-tailed research questions.

Procedure Used to Test for Significance

Whenever we perform a significance test, it involves comparing a test value that we

have calculated to some critical value for the statistic. It doesn't matter what type of
statistic we are calculating (e.g., a t-statistic, a chi-square statistic, an F-statistic, etc.),
the procedure to test for significance is the same.

1. Decide on the critical alpha level you will use (i.e., the error rate you are willing to
2. Conduct the research.
3. Calculate the statistic.
4. Compare the statistic to a critical value obtained from a table.

If your statistic is higher than the critical value from the table:

• Your finding is significant.

• You reject the null hypothesis.
• The probability is small that the difference or relationship happened by chance,
and p is less than the critical alpha level (p < alpha ).

If your statistic is lower than the critical value from the table:

• Your finding is not significant.

• You fail to reject the null hypothesis.
• The probability is high that the difference or relationship happened by chance,
and p is greater than the critical alpha level (p > alpha).

Modern computer software can calculate exact probabilities for most test statistics. If
you have an exact probability from computer software, simply compare it to your critical
alpha level. If the exact probability is less than the critical alpha level, your finding is
significant, and if the exact probability is greater than your critical alpha level, your
finding is not significant. Using a table is not necessary when you have the exact
probability for a statistic.


In hypothesis testing, there are two types of errors. The first is type I error and
the second is type II error.

Type I error
In hypothesis testing, type I errors occurs when we are rejecting the null
hypothesis, but that hypothesis was true. In hypothesis testing, type I error is denoted
by alpha. In Hypothesis testing, the normal curve that shows the critical region is called
the alpha region. Even though it is unlikely that the test statistics will fall into the critical
region (red) when the null hypothesis is true, it is still possible. When this occurs, we
reject H0, when indeed it is true, and therefore make an error in doing so.

The probability of rejecting the H0 when it is true is called , where

Type II errors

In hypothesis testing, type II errors occur when we accept the null hypothesis but
it is false. In hypothesis testing, type II errors are denoted by beta. In Hypothesis
testing, the normal curve that shows the acceptance region is called the beta region.

The probability of accepting the H0 when it is false is called , where


The t-test (or student's t-test) gives an indication of the separateness of two sets
of measurements, and is thus used to check whether two sets of measures are
essentially different (and usually that an experimental effect has been demonstrated).
The typical way of doing this is with the null hypothesis that means of the two sets of
measures are equal.

The t-test assumes:

• A normal distribution (parametric data)

• Underlying variances are equal (if not, use Welch's test)

It is used when there is random assignment and only two sets of measurement to

There are two main types of t-test:

• Independent-measures t-test: when samples are not matched.

• Matched-pair t-test: When samples appear in pairs (eg. before-and-after).

A single-sample t-test compares a sample against a known figure, for example

where measures of a manufactured item are compared against the required standard.

The value of t may be calculated using packages such as SPSS. The actual calculation
for two groups is:
t = experimental effect / variability
= difference between group means /
standard error of difference between group means

The resultant t-value is then looked up in a t-table to determine the probability
that a significant difference between the two sets of measures exists and hence what
can be claimed about the efficacy of the experimental treatment.

The t-value can also be converted to a Pearson r-value to measure effect, which
can be calculated as:
r = SQRT( t2 / (t2 + DF))
where DF is the degrees of freedom.
In a t-test, DF = N1 + N2 - 2.

Reporting a t-test might look something like this:
On average, the reported relationship between holidays in the south
(M=24.1, SE=1.5) were significantly preferred than holidays in the north
(M=20.1, SE=1.2), t(22)=2.3, p<.05, r=.44.
In this, 'M' is the mean and 'SE' the standard error of each sample. In 't(X)=Y', X is
the degrees of freedom and Y is the t-metric. 'p' is the probability of a type-1 error and
'r' is the effect.

The t-test was described by 1908 by William Sealy Gosset for monitoring the
brewing at Guinness in Dublin. Guinness considered the use of statistics a trade secret,
so he published his test under the pen-name 'Student' -- hence the test is now often
called the 'Student's t-test'.
The t-test is a basic test that is limited to two groups. For multiple groups, you
would have to compare each pair of groups, for example with three groups there would
be three tests (AB, AC, BC), whilst with seven groups there would need to be 21 tests.
The basic principle is to test the null hypothesis that the means of the two groups
are equal.
A significant problem with this is that we typically accept significance with each t-
test of 95% (p=0.05). For multiple tests these accumulate and hence reduce the
validity of the results.


The Z-test compares sample and population means to determine if there is a
significant difference.
It requires a simple random sample from a population with a Normal distribution
and where where the mean is known.
A statistical test of the null hypothesis that a population parameter μ is equal to a
given value μ 0 . We construct a z-statistic for the null hypothesis, i.e. a statistic which,
under the null hypothesis, has mean zero and approximately a standard normal
distribution. Then we accept the null hypothesis if z is less than z p for a one tailed test
with probability p, where z p is the pth percentile of the standard normal distribution.

The z measure is calculated as:
z = (x - ) / SE
where x is the mean sample to be standardized, (mu) is the population
mean and SE is the standard error of the mean.
SE = / SQRT(n)
where  is the population standard deviation and n is the sample size.
The z value is then looked up in a z-table. A negative z value means it is below the
population mean (the sign is ignored in the lookup table).

The Z-test is typically with standardized tests, checking whether the scores from
a particular sample are within or outside the standard test performance.
The z value indicates the number of standard deviation units of the sample from the
population mean.


The correlation is one of the most common and most useful statistics. A
correlation is a single number that describes the degree of relationship between two
variables. Let's work through an example to show you how this statistic is computed.

Correlation Example

Let's assume that we want to look at the relationship between two variables,
height (in inches) and self esteem. Perhaps we have a hypothesis that how tall you are
effects your self esteem (incidentally, I don't think we have to worry about the direction
of causality here -- it's not likely that self esteem causes your height!). Let's say we
collect some information on twenty individuals (all male -- we know that the average
height differs for males and females so, to keep this example simple we'll just use
males). Height is measured in inches. Self esteem is measured based on the average
of 10 1-to-5 rating items (where higher scores mean higher self esteem). Here's the
data for the 20 cases (don't take this too seriously -- I made this data up to illustrate
what a correlation is):

Person Height Self Esteem

1 68 4.1
2 71 4.6
3 62 3.8
4 75 4.4
5 58 3.2
6 60 3.1
7 67 3.8
8 68 4.1
9 71 4.3
10 69 3.7
11 68 3.5
12 67 3.2
13 63 3.7
14 62 3.3
15 60 3.4
16 63 4.0
17 65 4.1
18 67 3.8
19 63 3.4
20 61 3.6
Now, let's take a quick look at the histogram for each variable:

And, here are the descriptive statistics:

Variable Mean StDev Variance Sum Minimum Maximum Range

Height 65.4 4.40574 19.4105 1308 58 75 17
3.755 0.426090 0.181553 75.1 3.1 4.6 1.5

Finally, we'll look at the simple bivariate (i.e., two-variable) plot:

You should immediately see in the bivariate plot that the relationship between the
variables is a positive one (if you can't see that, review the section on types of
relationships) because if you were to fit a single straight line through the dots it would
have a positive slope or move up from left to right. Since the correlation is nothing more
than a quantitative estimate of the relationship, we would expect a positive correlation.

What does a "positive relationship" mean in this context? It means that, in

general, higher scores on one variable tend to be paired with higher scores on the other
and that lower scores on one variable tend to be paired with lower scores on the other.
You should confirm visually that this is generally true in the plot above.
Calculating the Correlation

Now we're ready to compute the correlation value. The formula for the correlation is:

We use the symbol r to stand for the correlation. Through the magic of
mathematics it turns out that r will always be between -1.0 and +1.0. if the correlation is
negative, we have a negative relationship; if it's positive, the relationship is positive. You
don't need to know how we came up with this formula unless you want to be a
statistician. But you probably will need to know how the formula relates to real data --
how you can use the formula to compute the correlation. Let's look at the data we need
for the formula. Here's the original data with the other necessary columns:

Self Esteem
Person Height (x) x*y x*x y*y
1 68 4.1 278.8 4624 16.81
2 71 4.6 326.6 5041 21.16
3 62 3.8 235.6 3844 14.44
4 75 4.4 330 5625 19.36
5 58 3.2 185.6 3364 10.24
6 60 3.1 186 3600 9.61
7 67 3.8 254.6 4489 14.44
8 68 4.1 278.8 4624 16.81
9 71 4.3 305.3 5041 18.49
10 69 3.7 255.3 4761 13.69
11 68 3.5 238 4624 12.25
12 67 3.2 214.4 4489 10.24
13 63 3.7 233.1 3969 13.69
14 62 3.3 204.6 3844 10.89
15 60 3.4 204 3600 11.56
16 63 4 252 3969 16
17 65 4.1 266.5 4225 16.81
18 67 3.8 254.6 4489 14.44
19 63 3.4 214.2 3969 11.56
20 61 3.6 219.6 3721 12.96
Sum = 1308 75.1 4937.6 85912 285.45

The first three columns are the same as in the table above. The next three
columns are simple computations based on the height and self esteem data. The
bottom row consists of the sum of each column. This is all the information we need to
compute the correlation. Here are the values from the bottom row of the table (where N
is 20 people) as they are related to the symbols in the formula:

Now, when we plug these values into the formula given above, we get the
following (I show it here tediously, one step at a time):
So, the correlation for our twenty cases is .73, which is a fairly strong positive
relationship. I guess there is a relationship between height and self esteem, at least in
this made up data!

Testing the Significance of a Correlation

Once you've computed a correlation, you can determine the probability that the
observed correlation occurred by chance. That is, you can conduct a significance test.
Most often you are interested in determining the probability that the correlation is a real
one and not a chance occurrence. In this case, you are testing the mutually
exclusive hypotheses:

Null Hypothesis: r=0

Alternative Hypothesis: r <> 0

The easiest way to test this hypothesis is to find a statistics book that has a table
of critical values of r. Most introductory statistics texts would have a table like this. As in
all hypotheses testing, you need to first determine the significance level. Here, I'll use
the common significance level of alpha = .05. This means that I am conducting a test
where the odds that the correlation is a chance occurrence are no more than 5 out of
100. Before I look up the critical value in a table I also have to compute the degrees of
freedom or df. The df is simply equal to N-2 or, in this example, is 20-2 = 18. Finally, I
have to decide whether I am doing a one-tailed or two-tailed test. In this example, since
I have no strong prior theory to suggest whether the relationship between height and
self esteem would be positive or negative, I'll opt for the two-tailed test. With these three
pieces of information -- the significance level (alpha = .05)), degrees of freedom (df =
18), and type of test (two-tailed) -- I can now test the significance of the correlation I
found. When I look up this value in the handy little table at the back of my statistics book
I find that the critical value is .4438. This means that if my correlation is greater than .
4438 or less than -.4438 (remember, this is a two-tailed test) I can conclude that the
odds are less than 5 out of 100 that this is a chance occurrence. Since my correlation 0f
.73 is actually quite a bit higher, I conclude that it is not a chance finding and that the
correlation is "statistically significant" (given the parameters of the test). I can reject the
null hypothesis and accept the alternative.

The Correlation Matrix

All I've shown you so far is how to compute a correlation between two variables.
In most studies we have considerably more than two variables. Let's say we have a
study with 10 interval-level variables and we want to estimate the relationships among
all of them (i.e., between all possible pairs of variables). In this instance, we have 45
unique correlations to estimate (more later on how I knew that!). We could do the above
computations 45 times to obtain the correlations. Or we could use just about any
statistics program to automatically compute all 45 with a simple click of the mouse.
I used a simple statistics program to generate random data for 10 variables with 20
cases (i.e., persons) for each variable. Then, I told the program to compute the
correlations among these variables. Here's the result:

C1 C2 C3 C4 C5 C6 C7 C8
C9 C10
C1 1.000
C2 0.274 1.000
C3 -0.134 -0.269 1.000
C4 0.201 -0.153 0.075 1.000
C5 -0.129 -0.166 0.278 -0.011 1.000
C6 -0.095 0.280 -0.348 -0.378 -0.009 1.000
C7 0.171 -0.122 0.288 0.086 0.193 0.002 1.000
C8 0.219 0.242 -0.380 -0.227 -0.551 0.324 -0.082 1.000
C9 0.518 0.238 0.002 0.082 -0.015 0.304 0.347 -0.013
C10 0.299 0.568 0.165 -0.122 -0.106 -0.169 0.243 0.014
0.352 1.000

This type of table is called a correlation matrix. It lists the variable names (C1-
C10) down the first column and across the first row. The diagonal of a correlation matrix
(i.e., the numbers that go from the upper left corner to the lower right) always consists of
ones. That's because these are the correlations between each variable and itself (and a
variable is always perfectly correlated with itself). This statistical program only shows
the lower triangle of the correlation matrix. In every correlation matrix there are two
triangles that are the values below and to the left of the diagonal (lower triangle) and
above and to the right of the diagonal (upper triangle). There is no reason to print both
triangles because the two triangles of a correlation matrix are always mirror images of
each other (the correlation of variable x with variable y is always equal to the correlation
of variable y with variable x). When a matrix has this mirror-image quality above and
below the diagonal we refer to it as asymmetric matrix. A correlation matrix is always a
symmetric matrix.

To locate the correlation for any pair of variables, find the value in the table for
the row and column intersection for those two variables. For instance, to find the
correlation between variables C5 and C2, I look for where row C2 and column C5 is (in
this case it's blank because it falls in the upper triangle area) and where row C5 and
column C2 is and, in the second case, I find that the correlation is -.166.
OK, so how did I know that there are 45 unique correlations when we have 10
variables? There's a handy simple little formula that tells how many pairs (e.g.,
correlations) there are for any number of variables:

where N is the number of variables. In the example, I had 10 variables, so I know I have
(10 * 9)/2 = 90/2 = 45 pairs.

Other Correlations

The specific type of correlation I've illustrated here is known as the Pearson
Product Moment Correlation. It is appropriate when both variables are measured at
an interval level. However there are a wide variety of other types of correlations for
other circumstances. for instance, if you have two ordinal variables, you could use the
Spearman rank Order Correlation (rho) or the Kendall rank order Correlation (tau).
When one measure is a continuous interval level one and the other is dichotomous (i.e.,
two-category) you can use the Point-Biserial Correlation.

Regression Analysis

Simple regression is used to examine the relationship between one dependent

and one independent variable. After performing an analysis, the regression statistics
can be used to predict the dependent variable when the independent variable is known.
Regression goes beyond correlation by adding prediction capabilities.

People use regression on an intuitive level every day. In business, a well-dressed

man is thought to be financially successful. A mother knows that more sugar in her
children's diet results in higher energy levels. The ease of waking up in the morning
often depends on how late you went to bed the night before. Quantitative regression
adds precision by developing a mathematical formula that can be used for predictive

For example, a medical researcher might want to use body weight (independent
variable) to predict the most appropriate dose for a new drug (dependent variable). The
purpose of running the regression is to find a formula that fits the relationship between
the two variables. Then you can use that formula to predict values for the dependent
variable when only the independent variable is known. A doctor could prescribe the
proper dose based on a person's body weight.

The regression line (known as the least squares line) is a plot of the expected
value of the dependent variable for all values of the independent variable. Technically, it
is the line that "minimizes the squared residuals". The regression line is the one that
best fits the data on a scatterplot.

Using the regression equation, the dependent variable may be predicted from the
independent variable. The slope of the regression line (b) is defined as the rise divided
by the run. The y intercept (a) is the point on the y axis where the regression line would
intercept the y axis. The slope and y intercept are incorporated into the regression
equation. The intercept is usually called the constant, and the slope is referred to as the
coefficient. Since the regression model is usually not a perfect predictor, there is also an
error term in the equation.

In the regression equation, y is always the dependent variable and x is always

the independent variable. Here are three equivalent ways to mathematically describe a
linear regression model.

y = intercept + (slope x) + error

y = constant + (coefficient x) + error

y = a + bx + e

The significance of the slope of the regression line is determined from the t-
statistic. It is the probability that the observed correlation coefficient occurred by chance
if the true correlation is zero. Some researchers prefer to report the F-ratio instead of
the t-statistic. The F-ratio is equal to the t-statistic squared.

The t-statistic for the significance of the slope is essentially a test to determine if
the regression model (equation) is usable. If the slope is significantly different than zero,
then we can use the regression model to predict the dependent variable for any value of
the independent variable.

On the other hand, take an example where the slope is zero. It has no prediction
ability because for every value of the independent variable, the prediction for the
dependent variable would be the same. Knowing the value of the independent variable
would not improve our ability to predict the dependent variable. Thus, if the slope is not
significantly different than zero, don't use the model to make predictions.

The coefficient of determination (r-squared) is the square of the correlation

coefficient. Its value may vary from zero to one. It has the advantage over the
correlation coefficient in that it may be interpreted directly as the proportion of variance
in the dependent variable that can be accounted for by the regression equation. For
example, an r-squared value of .49 means that 49% of the variance in the dependent
variable can be explained by the regression equation. The other 51% is unexplained.

The standard error of the estimate for regression measures the amount of
variability in the points around the regression line. It is the standard deviation of the data
points as they are distributed around the regression line. The standard error of the
estimate can be used to develop confidence intervals around a prediction.


A company wants to know if there is a significant relationship between its

advertising expenditures and its sales volume. The independent variable is advertising
budget and the dependent variable is sales volume. A lag time of one month will be
used because sales are expected to lag behind actual advertising expenditures. Data
was collected for a six month period. All figures are in thousands of dollars. Is there a
significant relationship between advertising budget and sales volume?

Indep. Var. Depen. Var

4.2 27.1

6.1 30.4

3.9 25.0

5.7 29.7

7.3 40.1

5.9 28.8


Model: y = 10.079 + (3.700 x) + error

Standard error of the estimate = 2.568
t-test for the significance of the slope = 4.095
Degrees of freedom = 4
Two-tailed probability = .0149
r-squared = .807

You might make a statement in a report like this: A simple linear regression was
performed on six months of data to determine if there was a significant relationship
between advertising expenditures and sales volume. The t-statistic for the slope was
significant at the .05 critical alpha level, t(4)=4.10, p=.015. Thus, we reject the null
hypothesis and conclude that there was a positive significant relationship between
advertising expenditures and sales volume. Furthermore, 80.7% of the variability in
sales volume could be explained by advertising expenditures.

In statistics, analysis of variance (ANOVA) is a collection of statistical models,
and their associated procedures, in which the observed variance in a particular variable
is partitioned into components due to different sources of variation. In its simplest form
ANOVA provides a statistical test of whether or not the means of several groups are all
equal, and therefore generalizes Student's two-sample t-test to more than two groups.
ANOVAs are helpful because they possess an advantage over a two-sample t-test.
Doing multiple two-sample t-tests would result in an increased chance of committing a
type I error. For this reason, ANOVAs are useful in comparing three or more means.

There are three classes of ANOVA models:

1. Fixed-effects models assume that the data came from normal populations which
may differ only in their means. (Model 1)
2. Random effects models assume that the data describe a hierarchy of different
populations whose differences are constrained by the hierarchy. (Model 2)
3. Mixed-effect models describe the situations where both fixed and random effects
are present. (Model 3)

In practice, there are several types of ANOVA depending on the number of treatments
and the ways they are applied to the subjects in the experiment are:

• One-way ANOVA is used to test for differences among two or more independent
groups. Typically, however, the one-way ANOVA is used to test for differences
among at least three groups, since the two-group case can be covered by a t-test
(Gosset, 1908). When there are only two means to compare, the t-test and the
ANOVA F-test are equivalent; the relation between ANOVA and t is given by
F = t2.
• Factorial ANOVA is used when the experimenter wants to study the effects of
two or more treatment variables. The most commonly used type of factorial
ANOVA is the 22 (read "two by two") design, where there are two independent
variables and each variable has two levels or distinct values. However, such use
of ANOVA for analysis of 2kfactorial designs and fractional factorial designs is
"confusing and makes little sense"; instead it is suggested to refer the value of
the effect divided by its standard error to a t-table. Factorial ANOVA can also be
multi-level such as 33, etc. or higher order such as 2×2×2, etc. Since the
introduction of data analytic software, the utilization of higher order designs and
analyses has become quite common.
• Repeated measures ANOVA is used when the same subjects are used for each
treatment (e.g., in a longitudinal study). Note that such within-subjects designs
can be subject to carry-over effects.
• Mixed-design ANOVA. When one wishes to test two or more independent
groups subjecting the subjects to repeated measures, one may perform a
factorial mixed-design ANOVA, in which one factor is a between-subjects
variable and the other is within-subjects variable. This is a type of mixed-effect
• Multivariate analysis of variance (MANOVA) is used when there is more than
one dependent variable.

Chi Square Test

Pearson's chi-square is used to assess two types of comparison: tests of

goodness of fit and tests of independence. A test of goodness of fit establishes whether
or not an observed frequency distribution differs from a theoretical distribution. A test of
independence assesses whether paired observations on two variables, expressed in a
contingency table, are independent of each other – for example, whether people from
different regions differ in the frequency with which they report that they support a
political candidate.

The first step in the chi-square test is to calculate the chi-square statistic. In order
to avoid ambiguity, the value of the test-statistic is denoted by Χ2 rather than χ2 (i.e.
uppercase chi instead of lowercase); this also serves as a reminder that the distribution
of the test statistic is not exactly that of a chi-square random variable. However some
authors do use the χ2 notation for the test statistic. An exact test which does not rely on
using the approximate χ2 distribution is Fisher's exact test: this is significantly more
accurate in evaluating the significance level of the test, especially with small numbers of

The chi-square statistic is calculated by finding the difference between each

observed and theoretical frequency for each possible outcome, squaring them, dividing
each by the theoretical frequency, and taking the sum of the results. A second important
part of determining the test statistic is to define the degrees of freedom of the test: this
is essentially the number of observed frequencies adjusted for the effect of using some
of those observations to define the "theoretical frequencies".

Nonparametric Statistics

General Purpose:

Brief review of the idea of significance testing.

To understand the idea of nonparametric statistics (the term nonparametric was

first used by Wolfowitz, 1942) first requires a basic understanding of parametric
statistics. Elementary Concepts introduces the concept of statistical significance
testing based on the sampling distribution of a particular statistic (you may want to
review that topic before reading on). In short, if we have a basic knowledge of the
underlying distribution of a variable, then we can make predictions about how, in
repeated samples of equal size, this particular statistic will "behave," that is, how it is
distributed. For example, if we draw 100 random samples of 100 adults each from the
general population, and compute the mean height in each sample, then the distribution
of the standardized means across samples will likely approximate the normal
distribution (to be precise, Student's t distribution with 99 degrees of freedom; see
below). Now imagine that we take an additional sample in a particular city ("Tallburg")
where we suspect that people are taller than the average population. If the mean height
in that sample falls outside the upper 95% tail area of the t distribution then we conclude
that, indeed, the people of Tallburg are taller than the average population.

Are most variables normally distributed?

In the above example we relied on our knowledge that, in repeated samples of

equal size, the standardized means (for height) will be distributed following the t
distribution (with a particular mean and variance). However, this will only be true if in the
population the variable of interest (height in our example) is normally distributed, that is,
if the distribution of people of particular heights follows the normal distribution (the bell-
shape distribution).

For many variables of interest, we simply do not know for sure that this is the
case. For example, is income distributed normally in the population? -- probably not.
The incidence rates of rare diseases are not normally distributed in the population, the
number of car accidents is also not normally distributed, and neither are very many
other variables in which a researcher might be interested.

For more information on the normal distribution, see Elementary Concepts; for
information on tests of normality, see Normality tests.

Sample size

Another factor that often limits the applicability of tests based on the assumption
that the sampling distribution is normal is the size of the sample of data available for the
analysis (sample size; n). We can assume that the sampling distribution is normal even
if we are not sure that the distribution of the variable in the population is normal, as long
as our sample is large enough (e.g., 100 or more observations). However, if our sample
is very small, then those tests can be used only if we are sure that the variable is
normally distributed, and there is no way to test this assumption if the sample is small.
Problems in measurement

Applications of tests that are based on the normality assumptions are further
limited by a lack of precise measurement. For example, let us consider a study where
grade point average (GPA) is measured as the major variable of interest. Is an A
average twice as good as a C average? Is the difference between a B and an A
average comparable to the difference between a D and a C average? Somehow, the
GPA is a crude measure of scholastic accomplishments that only allows us to establish
a rank ordering of students from "good" students to "poor" students. This general
measurement issue is usually discussed in statistics textbooks in terms of types of
measurement or scale of measurement. Without going into too much detail, most
common statistical techniques such as analysis of variance (and t- tests), regression,
etc., assume that the underlying measurements are at least of interval, meaning that
equally spaced intervals on the scale can be compared in a meaningful manner (e.g, B
minus A is equal to D minus C). However, as in our example, this assumption is very
often not tenable, and the data rather represent a rank ordering of observations
(ordinal) rather than precise measurements.

Parametric and nonparametric methods

Hopefully, after this somewhat lengthy introduction, the need is evident for
statistical procedures that enable us to process data of "low quality," from small
samples, on variables about which nothing is known (concerning their distribution).
Specifically, nonparametric methods were developed to be used in cases when the
researcher knows nothing about the parameters of the variable of interest in the
population (hence the name nonparametric). In more technical terms, nonparametric
methods do not rely on the estimation of parameters (such as the mean or the standard
deviation) describing the distribution of the variable of interest in the population.
Therefore, these methods are also sometimes (and more appropriately) called
parameter-free methods or distribution-free methods.

Brief Overview of Nonparametric Methods

Basically, there is at least one nonparametric equivalent for each parametric general
type of test. In general, these tests fall into the following categories:

 Tests of differences between groups (independent samples);

 Tests of differences between variables (dependent samples);
 Tests of relationships between variables.

Differences between independent groups

Usually, when we have two samples that we want to compare concerning their
mean value for some variable of interest, we would use the t-test for independent
samples); nonparametric alternatives for this test are the Wald-Wolfowitz runs test, the
Mann-Whitney U test, and the Kolmogorov-Smirnov two-sample test. If we have multiple
groups, we would use analysis of variance (see ANOVA/MANOVA; the nonparametric
equivalents to this method are the Kruskal-Wallis analysis of ranks and the Median test.

Differences between dependent groups

If we want to compare two variables measured in the same sample we would

customarily use the t-test for dependent samples (in Basic Statistics for example, if
we wanted to compare students' math skills at the beginning of the semester with their
skills at the end of the semester). Nonparametric alternatives to this test are the Sign
test and Wilcoxon's matched pairs test. If the variables of interest are dichotomous in
nature (i.e., "pass" vs. "no pass") then McNemar's Chi-square test is appropriate. If
there are more than two variables that were measured in the same sample, then we
would customarily use repeated measures ANOVA. Nonparametric alternatives to this
method are Friedman's two-way analysis of variance and Cochran Q test (if the variable
was measured in terms of categories, e.g., "passed" vs. "failed"). Cochran Q is
particularly useful for measuring changes in frequencies (proportions) across time.

Relationships between variables

To express a relationship between two variables one usually computes the

correlation coefficient. Nonparametric equivalents to the standard correlation coefficient
are Spearman R, Kendall Tau, and coefficient Gamma (see Nonparametric
correlations). If the two variables of interest are categorical in nature (e.g., "passed" vs.
"failed" by "male" vs. "female") appropriate nonparametric statistics for testing the
relationship between the two variables are the Chi-square test, the Phi coefficient, and
the Fisher exact test. In addition, a simultaneous test for relationships between multiple
cases is available: Kendall coefficient of concordance. This test is often used for
expressing inter-rater agreement among independent judges who are rating (ranking)
the same stimuli.

Descriptive statistics

When one's data are not normally distributed, and the measurements at best
contain rank order information, then computing the standard descriptive statistics (e.g.,
mean, standard deviation) is sometimes not the most informative way to summarize the
data. For example, in the area of psychometrics it is well known that the rated intensity
of a stimulus (e.g., perceived brightness of a light) is often a logarithmic function of the
actual intensity of the stimulus (brightness as measured in objective units of Lux). In this
example, the simple mean rating (sum of ratings divided by the number of stimuli) is not
an adequate summary of the average actual intensity of the stimuli. (In this example,
one would probably rather compute the geometric mean.) Nonparametrics and
Distributions will compute a wide variety of measures of location (mean, median,
mode, etc.) and dispersion (variance, average deviation, quartile range, etc.) to provide
the "complete picture" of one's data.

When to Use Which Method

It is not easy to give simple advice concerning the use of nonparametric

procedures. Each nonparametric procedure has its peculiar sensitivities and blind spots.
For example, the Kolmogorov-Smirnov two-sample test is not only sensitive to
differences in the location of distributions (for example, differences in means) but is also
greatly affected by differences in their shapes. The Wilcoxon matched pairs test
assumes that one can rank order the magnitude of differences in matched observations
in a meaningful manner. If this is not the case, one should rather use the Sign test. In
general, if the result of a study is important (e.g., does a very expensive and painful
drug therapy help people get better?), then it is always advisable to run different
nonparametric tests; should discrepancies in the results occur contingent upon which
test is used, one should try to understand why some tests give different results. On the
other hand, nonparametric statistics are less statistically powerful (sensitive) than their
parametric counterparts, and if it is important to detect even small effects (e.g., is this
food additive harmful to people?) one should be very careful in the choice of a test

Large data sets and nonparametric methods

Nonparametric methods are most appropriate when the sample sizes are small.
When the data set is large (e.g., n > 100) it often makes little sense to use
nonparametric statistics at all. Elementary Concepts briefly discusses the idea of the
central limit theorem. In a nutshell, when the samples become very large, then the
sample means will follow the normal distribution even if the respective variable is not
normally distributed in the population, or is not measured very well. Thus, parametric
methods, which are usually much more sensitive (i.e., have more statistical power) are
in most cases appropriate for large samples. However, the tests of significance of many
of the nonparametric statistics described here are based on asymptotic (large sample)
theory; therefore, meaningful tests can often not be performed if the sample sizes
become too small. Please refer to the descriptions of the specific tests to learn more
about their power and efficiency

Nonparametric Correlations

The following are three types of commonly used nonparametric correlation

coefficients (Spearman R, Kendall Tau, and Gamma coefficients). Note that the chi-
square statistic computed for two-way frequency tables, also provides a careful
measure of a relation between the two (tabulated) variables, and unlike the correlation
measures listed below, it can be used for variables that are measured on a simple
nominal scale.
Spearman R

Spearman R (Siegel & Castellan, 1988) assumes that the variables under
consideration were measured on at least an ordinal (rank order) scale, that is, that the
individual observations can be ranked into two ordered series. Spearman R can be
thought of as the regular Pearson product moment correlation coefficient, that is, in
terms of proportion of variability accounted for, except that Spearman R is computed
from ranks.

Kendall tau

Kendall tau is equivalent to Spearman R with regard to the underlying

assumptions. It is also comparable in terms of its statistical power. However, Spearman
R and Kendall tau are usually not identical in magnitude because their underlying logic
as well as their computational formulas are very different. Siegel and Castellan (1988)
express the relationship of the two measures in terms of the inequality: More
importantly, Kendall tau and Spearman R imply different interpretations: Spearman R
can be thought of as the regular Pearson product moment correlation coefficient, that is,
in terms of proportion of variability accounted for, except that Spearman R is computed
from ranks. Kendall tau, on the other hand, represents a probability, that is, it is the
difference between the probability that in the observed data the two variables are in the
same order versus the probability that the two variables are in different orders.

-1  3 * Kendall tau - 2 * Spearman R  1


The Gamma statistic (Siegel & Castellan, 1988) is preferable to Spearman R or

Kendall tau when the data contain many tied observations. In terms of the underlying
assumptions, Gamma is equivalent to Spearman R or Kendall tau; in terms of its
interpretation and computation it is more similar to Kendall tau than Spearman R. In
short, Gamma is also a probability; specifically, it is computed as the difference between
the probability that the rank ordering of the two variables agree minus the probability
that they disagree, divided by 1 minus the probability of ties. Thus, Gamma is basically
equivalent to Kendall tau, except that ties are explicitly taken into account.

Parametric tests

Restrictions of parametric tests

Conventional statistical procedures are also called parametric tests. In a parametric
test a sample statistic is obtained to estimate the population parameter. Because this
estimation process involves a sample, a sampling distribution, and a population, certain
parametric assumptions are required to ensure all components are compatible with
each other. For example, in Analysis of Variance (ANOVA) there are three assumptions:
• Observations are independent.
• The sample data have a normal distribution.
• Scores in different groups have homogeneous variances.

In a repeated measure design, it is assumed that the data structure conforms to the
compound symmetry. A regression model assumes the absence of collinearity, the
absence of auto correlation, random residuals, linearity...etc. In structural equation
modeling, the data should be multivariate normal.

Why are they important? Take ANOVA as an example. ANOVA is a procedure of

comparing means in terms of variance with reference to a normal distribution. The
inventor of ANOVA, Sir R. A. Fisher (1935) clearly explained the relationship among the
mean, the variance, and the normal distribution: "The normal distribution has only two
characteristics, its mean and its variance. The mean determines the bias of our
estimate, and the variance determines its precision." (p.42) It is generally known that the
estimation is more precise as the variance becomes smaller and smaller.

Parametric statistics is a branch of statistics that assumes data has come from a type
of probability distribution and makes inferences about the parameters of the distribution.
Most well-known elementary statistical methods are parametric.[2]

Generally speaking parametric methods make more assumptions than non-

parametric methods.[3] If those extra assumptions are correct, parametric methods can
produce more accurate and precise estimates. They are said to have more statistical
power. However, if those assumptions are incorrect, parametric methods can be very
misleading. For that reason they are often not considered robust. On the other hand,
parametric formulae are often simpler to write down and faster to compute. In some, but
definitely not all cases, their simplicity makes up for their non-robustness, especially if
care is taken to examine diagnostic statistics.[4]

Because parametric statistics require a probability distribution, they are not distribution-


Suppose we have a sample of 99 test scores with a mean of 100 and a standard
deviation of 10. If we assume all 99 test scores are random samples from a normal
distribution we predict there is a 1% chance that the 100 th test score will be higher than
123.65 (that is the mean plus 2.365 standard deviations) assuming that the 100 th test
score comes from the same distribution as the others. The normal family of distributions
all have the same shape and are parameterized by mean and standard deviation. That
means if you know the mean and standard deviation, and that the distribution is normal,
you know the probability of any future observation. Parametric statistical methods are
used to compute the 2.365 value above, given 99 independent observations from the
same normal distribution.

A non-parametric estimate of the same thing is the maximum of the first 99

scores. We don't need to assume anything about the distribution of test scores to
reason that before we gave the test it was equally likely that the highest score would be
any of the first 100. Thus there is a 1% chance that the 100 th is higher than any of the
99 that preceded it.

Parametric vs. non-parametric tests

There are two types of test data and consequently different types of analysis. As
the table below shows, parametric data has an underlying normal distribution which
allows for more conclusions to be drawn as the shape can be mathematically
described. Anything else is non-parametric.

Parametric Non-parametric
Assumed distribution Normal Any
Assumed variance Homogeneous Any
Typical data Ratio or Interval Ordinal or Nominal
Data set relationships Independent Any
Usual central measure Mean Median
Benefits Can draw more Simplicity; Less affected
conclusions by outliers
Choosing Choosing parametric test Choosing a non-
parametric test
Correlation test Pearson Spearman
Independent measures, 2 Independent-measures t- Mann-Whitney test
groups test
Independent measures, One-way, independent- Kruskal-Wallis test
>2 groups measures ANOVA
Repeated measures, 2 Matched-pair t-test Wilcoxon test
Repeated measures, >2 One-way, repeated Friedman's test
conditions measures ANOVA