You are on page 1of 98

BIO 106

Learning Module
in

STATISTICAL BIOLOGY

Consolidated by:
Ms. Evangeline Joyce D. Jungay
Introduction

This module is intended for BS Biology first year undergraduate students of


Batangas State University as their fundamental course, enrolled in the
second semester, AY 2020-2021. The lecture component of Statistical
Biology provides students the opportunity to study the basic foundation on
statistical approaches to analyzing biological data. It focuses on the
pragmatic aspect rather than the detailed mathematical basis of a statistical
test. Using real biological data, students will be able to develop the
necessary skills in identifying statistical tests valuable for a particular
experiment, in analyzing, and interpreting the results.

Students may own their learning by studying this material in their own
convenient time outside of class schedule as long as they comply with the
completion of all the learning tasks to be completed during the term.

This module is composed of 3 main topics, divided into 14 areas relevant to


the course. Each main topic is divided into 4 sections:

 Objectives
 Lesson Proper
 Learning Tasks
 References
Intended Learning Outcomes

Upon completion of the course, students should be able to:

1. describe some basic principles in statistics relevant to biological


data analyses including data analyses steps, kinds of biological
variables, hypothesis testing and confounding variables;

2. identify some statistical tests for nominal variables;

3 . explain the practical use of descriptive statistics in biological data


analyses;

4 . compare different tests for one measurement and multiple


measurement variables; and

5. analyze some real biological data sets using statistics in


spreadsheets.
Topic I – Statistics Basics

Learning Objectives:

1. Describe some of the steps involved in data analyses.


2. Identify the different kinds of biological variables.
3. Explain the use of probability in analyzing biological data sets.

What is statistics?

Football scores, unemployment rates and lengths of hospital waiting


lists are statistics, but not what we commonly think of as being included in
the subject of statistics. More or less everything in this book is concerned
with trying to draw conclusions about very large groups of individuals
(animate or inanimate) when we can only study small samples of them. The
fact that we have to draw conclusions about' large groups by studying only
small samples is the main reason that we use statistics in environmental
and biological Science.

Supposing we select a small sample of individuals on which to carry


out a study. The questions we are trying to answer usually boil down to
these two:

• If I assume that the sample of individuals I have studied is representative


of the group they come from, what can I tell about the group as a whole?

• How confident can I be that the sample of individuals I have studied was
like the group as a whole?

These questions are central to the kind of statistical methods


described in this book and to most of those commonly used in practical
environmental or biological science. We are usually interested in a very
large group of individuals (e.g. bacteria in soil, ozone concentrations in the
air at some location which change moment by moment, or the yield of
wheat plants given a particular fertilizer treatment) but limited to studying a
small number of them because of time or resources.
Fortunately, if we select a sample of individuals in an appropriate
way and study them, we can usually get a very good idea about the rest of
the group. In fact, using small, representative samples is an excellent way
to study large groups and is the basis of most scientific research. Once we
have collected our data, our best estimate always has to be that the group
as a whole was just like the sample we studied; what other option do we
have? But in any scientific study, we cannot just assume this has to be
correct, we also need to use our data to say how confident we can be that
this is true. This is where statistics usually comes in.

We will look in more detail at how to interpret the various forms of


shorthand as we go through the different statistical techniques, but notice
that when the result is stated in full we have (i) a result for the whole group
of interest assuming that the samples studied were representative, and (ii)
a measure of confidence that the samples studied actually were
representative of the rest of the groups. This point is easy to lose sight of
when we start to look at different techniques.

Textbooks tend to emphasize differences between statistical


techniques so that you can see when to use each. However, these same
ideas lie behind nearly all of them. Statistical methods, in a wide variety of
disguises, aim to quantify both the effects we are studying (i.e. what the
samples showed), and the confidence we can have that what we observed
in our samples would also hold for the rest of the groups they were taken
from. If you can keep this fact in mind, you already understand the most
important point you need to know about statistics.

Steps in Data Analyses

1. Specify the biological question you are asking. (Ex. Do the amino
acid polymorphisms at the Pgm locus have an effect on glycogen
content?" The biological question is usually something about
biological processes, often in the form "Does changing X cause a
change in Y?" You might want to know whether a drug changes blood
pressure; whether soil pH affects the growth of blueberry bushes).
2. Put the question in the form of a biological null hypothesis and
alternate hypothesis. (The biological null hypothesis is "Different
amino acid sequences do not affect the biochemical properties of
PGM, so glycogen content is not affected by PGM sequence." The
biological alternative hypothesis is "Different amino acid sequences
do affect the biochemical properties of PGM, so glycogen content is
affected by PGM sequence." By thinking about the biological null and
alternative hypotheses, you are making sure that your experiment will
give different results for different answers to your biological question).
3. Put the question in the form of a statistical null hypothesis and
alternate hypothesis. (The statistical null hypothesis is "Flies with
different sequences of the PGM enzyme have the same average
glycogen content." The alternate hypothesis is "Flies with different
sequences of PGM have different average glycogen contents." While
the biological null and alternative hypotheses are about biological
processes, the statistical null and alternative hypotheses are all about
the numbers; in this case, the glycogen contents are either the same
or different).
4. Determine which variables are relevant to the question.
5. Determine what kind of variable each one is.
6. Design an experiment that controls or randomizes the confounding
variables.
7. Based on the number of variables, the kinds of variables, the
expected fit to the parametric assumptions, and the hypothesis to be
tested, choose the best statistical test to use.
8. Do the experiment.
9. Examine the data to see if it meets the assumptions of the statistical
test you chose. If it doesn't, choose a more appropriate test.

10. Apply the statistical test you chose and interpret the results.

11. Communicate your results effectively, usually with a graph or table.

One important point for you to remember: "do the experiment" is step 9,
not step 1. You should do a lot of thinking, planning, and decision-
making before you do an experiment. If you do this, you'll have an
experiment that is easy to understand, easy to analyze and interpret,
answers the questions you're trying to answer, and is neither too big nor
too small. If you just slap together an experiment without thinking about
how you're going to do the statistics, you may end up needing more
complicated and obscure statistical tests, getting results that are difficult to
interpret and explain to others, and maybe using too many subjects (thus
wasting your resources) or too few subjects (thus wasting the whole
experiment).

Different Kinds of Variables

Variables refer to characteristic of persons or objects which can take


on different values or labels for different persons or objects under
consideration. For example, undergraduate major is a variable that can
take on values such as epidemiology, biostatistics, mathematics, science
and the like. Another example of variables include smoking habit, attitude
toward the boss, height, faculty ranks, and so on.

There are two types of variables, namely: response variable and


explanatory variable. A variable which is affected by the value of some
other variable is called a response variable. This may continuous, ordinal
or nominal. In regression setting, they are called dependent variables or Y
variables. Explanatory variable is a variable that is thought to affect the
values of the response variable. It is sometimes called independent
variable or X variable in regression setting. In this case, explanatory
variable, like the response variable, may be continuous, ordinal or nominal.

Variables are classified into qualitative variable and quantitative


variable. Qualitative variable is one whose categories are simply used as
labels to distinguish one group from another, rather than as basis for saying
that one group is greater or less, higher or lower, or better or worse than
another. This variable has values that are intrinsically non-numeric. Clearly,
qualitative variables generally have either nominal or ordinal scales. For
examples: cause of death, nationality, race, gender, severity of pain and so
on.

Qualitative variables can be with reassigned numeric values but they


are still intrinsically qualitative. For instance, sex (male = 1, female = 0).
Some variables are always qualitative in nature. For example: occupation,
sex, disease status, cause of death, race, and the like.
Quantitative variable is one whose categories can be measured and
ordered according to quantity. These variables are values that are
intrinsically numeric. Number of children in a family and age are good
examples of quantitative variables. Both interval and ratio scales belong to
this classification.

Quantitative variables can be further divided into discrete and


continuous variables. Discrete variable refers to each element of a set of
possible values that is either finite or countably infinite that can appear only
as whole numbers. Examples are number of missing teeth, number of
household members, and number of patients at hospital X, among others.
In a discrete variable, there are gaps between its possible values.
Continuous variable refers to each element of a set of possible values
including all values in an interval of the real line that can be expressed with
fractions or digits after a decimal point. Body mass index, blood pressure,
cholesterol levels, and height are just some examples of continuous
variables. There is no gap between possible values in a continuous
variable.

Some variables can be measured both as qualitative or quantitative


variable, depending on the objectives of the data collection. For instance, if
height is recorded using the categories such as short, medium and tall then
it is a qualitative variable. However, if the actual height measurement is
recorded then it is a quantitative variable.

The use of probability in analyzing biological data sets

Null hypothesis and P-values

If you carry out a statistical test by computer it is usually fairly simple


to obtain the answer, but you should also be clear about what question this
is the answer to. Statistical tests start with a statement called a null
hypothesis, which is always along the lines 'there is no difference between
the populations', or 'there is no relationship between the measurements'.
The null hypothesis is often given the symbol H0 • The question the test is
really addressing is, How likely is it that the null hypothesis is true? And the
answer given by the test is usually in the form of a P-value (or probability).
The P-value given by the test tells us the probability of getting such a
result if the null hypothesis were true. A high P-value therefore tells us that
the null hypothesis could easily be true, so we should not conclude there is
a difference. A low P-value tells that the null hypothesis is unlikely to be
true and we should therefore conclude that there probably is a difference. It
is easier to remember a simple rule: • Low P-values indicate we can be
confident there is a difference. • High P-values indicate we cannot be
confident there is a difference. All statistical tests work this way round.

As stated above, the null hypothesis - a statement of what the test is


actually testing - is always something like 'there is no difference between
the groups we are comparing'. The exact format of the null hypothesis will
depend on the type of test we are using. We can also make a statement
about what else might be the case, which usually takes a form like 'there is
a difference'. This is called the alternative hypothesis and is often given the
symbol HI or HA • Between them, the null hypothesis and alternative
hypothesis should cover all eventualities.

A typical experiment might be trying to discover whether adding


fertilizer will improve the yield a farmer gets from his crops. We might be
quite confident that it will, but we must also allow for the possibility that it
could reduce yield, otherwise we will bias the result of the test in favour of
finding what we expect or hope to find. We therefore need to use a test that
allows for both possibilities, a so-called two-sided or two-tailed test (i.e. a
test with a two-sided alternative hypothesis).

Our alternative hypothesis would therefore be 'fertilizer produces a


difference in yield', which allows for the possibilities that fertilizer either
improves or reduces yield. Together with the null hypothesis 'fertilizer does
not affect yield', all possibilities are therefore covered. If the test tells us
that we can conclude there is a difference, and the sample means show
that yield was greater in the fertilized treatment, then we can conclude that
fertilizer improved yield.
Learning Task:

1. What could result when one does not follow the steps in analyzing
data?
2. What are the different kinds of biological variables?
3. What is the significance of using probability in analyzing data?

Topic II – Designing an Experiment or Survey

Learning Objectives:

1. Define population, frequency distribution, mean, standard deviation or


variance
2. Discuss hypothesis testing
3. Discuss exploratory data analysis and presentation.
4. Explain some of the common assumptions in statistical tests.

Very few environmental or biological studies are done solely for the
interest of the researcher; they are carried out to inform others. Therefore,
it is not just important that our results convince us, it is vital that they
convince others too, otherwise we have wasted our time. Following
accepted statistical procedures for design and analysis of experiments will
help us to achieve this.

Variability

Think of a group you might want to study, e.g. the lengths of fish in a large
lake. If all of these fish were the same length, you would only need to
measure one. You can probably accept that they are not all the same
length, just as people are not all the same height, not all volcanic lava flows
are the same temperature, and not all carrots have the same sugar
content. In fact, most characteristics we might want to study vary between
individuals.

If we measured the lengths of 100 fish, we could plot them on a graph.


When larger numbers of measurements are involved, it becomes
inconvenient to represent each individual, so a column graph or a line can
be used to show the shape of the distribution. These graphs are called
frequency distributions.

For some things we might measure, we would find different distributions


such as a lot of low values, some high values, and a few extremely high
values. These result in different shapes of graph. However, it turns out that
if we measure a set of naturally occurring lengths, concentrations, times,
temperatures, or whatever, and plot their distribution, very often we do get
a diagram with similar shape. This shape is called a Normal distribution.

Statisticians have derived a mathematical formula which, when plotted on a


graph, has the same shape. Being able to describe the distribution of
individual measurements using a mathematical formula turns out to be very
useful because, from only a few actual measurements, we can estimate
what other members of the population are likely to be like. This idea is the
basis of many statistical methods.

Samples and Population

Practical considerations almost always dictate that we study any group we


are interested in by making measurements or observations on a relatively
small sample of individuals. We call the group we are actually interested in
the population. A population in the statistical sense is fairly close to the
common meaning of the word, but can refer to things other than people,
and usually to some particular characteristic of them. Here are some
examples of statistical populations:

• The lengths of blue whales in' the Arctic Ocean

• All momentary light intensities at some point in a forest

• Root lengths of rice plants of a particular variety grown under a specific


set of conditions

In the first example, the population is real but we are unlikely to be able to
study all of the whales in practice. Populations in the statistical sense,
however, need not be finite, or even exist in real life. In the second
example, the light intensity could be measured at any moment, but the
number of moments is infinite, so we could never obtain measurements at
every moment. In the third example, the population is just conceptual. We
really want to know about how rice plants of this variety in general would
grow under these conditions but we would have to infer this by growing a
limited number of rice plants under the specified conditions. Although the
few plants in our sample may be the only rice plants ever to be grown in
these conditions, we still consider them to be a sample representing rice
plants of this variety in general growing in these conditions.

It is often useful to be able to characterize a population in terms of a few


well-chosen statistics. These allow us to summarize possibly large
numbers of measurements in order to present results and also to compare
populations with one another.

Mean, variance, and standard deviation

If we want to describe a population it may sometimes be useful to present a


frequency distribution, but this is usually more information than is needed.

Two items are often sufficient:

• A measure which tells us what a 'typical' member of the population is like


• A measure which tells us about how spread out the other members of the
population are around this 'typical' member.

To represent a 'typical' member of the population, we usually use the mean


(all of our values added together then divided by the number of values). In
common usage people often refer to this as the 'average' but the term
'mean' is preferred in technical writing.

To express how spread out the individual values in a population are, we


usually use the standard deviation or variance; variance is simply the
standard deviation squared.

Standard deviation and variance

One way to characterize how spread out the values in a sample are would
be to calculate the difference between each measurement and the sample
mean, and then to calculate the mean of these differences. Here's an
example:

Sample value 8 12 10 7 7 11 8 mean = 9.0

Difference from mean 1 3 1 2 2 2 1 mean of differences=


12/7 = 1.7

Had statistics been developed after computers became readily available,


this might be the measure of spread we commonly use; indeed some
recently developed statistical methods do use this. However, when
calculations were done by hand, it was found to be more convenient to use
the mean of the squares of the differences as a measure of spread; there
are mathematical shortcuts to getting the result in this case which are
useful when you have a large sample. Since a great deal of statistical
theory and tests built up around this, we still use it today.

For the above sample:

Sample value 8 12 10 7 7 11 8 mean = 9.0

Difference from mean 1 3 1 2 22 1 mean of differences=


12/7 = 1.7

Square of difference from mean 1 9 1 4 4 4 1

mean of squared differences = 24/7 = 3.4

When we take a random sample it may or may not include the largest and
smallest values in the population, yet these would both contribute the
largest squares of differences from the mean. Since they are not present in
all samples, on average the mean of the squares of differences is less
when it is calculated from a sample than if it was calculated for the
population as a whole.

However, what we really want is to estimate the spread of values in the


population, not in the sample itself, so we need to correct for this. This is
done by modifying the above calculation so that we divide not by the
number of values in our sample, but by one less than the number of values
in our sample. You can see from the example below that this corrects
things in the right direction. Some complex statistical theory shows that this
simple modification corrects the value by the right amount.

For the above sample again:

Sample value 8 12 10 7 7 11 8 mean = 9.0

Difference from mean 1 3 1 2 22 1 mean of differences=


12/7 = 1.7

Square of difference from mean 1 9 1 4 4 4 1 corrected mean: = 24/6=


4.0

This figure - the corrected mean of the squared differences - is called the
variance and is an unbiased estimate of the spread of values in the
population, calculated from a sample. Variance has units, e.g. if the
measurements had been in grams (g), the variance would be in units of
square grams (g2). The square root of the variance is called the standard
deviation. The standard deviation in the above example is 4.O = 2.0, i.e.
standard deviation is an alternative measure of the spread of values.
Standard deviation has the same units as the actual measurements, e.g. if
the measurements had been in grams, the standard deviation would also
be in grams.

In describing a population, we might therefore say, 'The mean length of fish


in the lake was 32 cm with a standard deviation of 10 cm.' This tells us that
most (approximately 68%) of the fish had lengths in the range 22-42 cm.
Popular texts and the media often just give the mean with no measure of
spread, but as scientists we should recognize that both measures are
important. In a different lake the fish might have the same mean length but
a very different spread of values. This might have important scientific
implications. For example, if big fish eat little fish, the ecology of a lake with
a wide range of sizes may be very different to that in a lake where the fish
are all about the same size.

Hypothesis Testing

Hypothesis testing is a kind of statistical inference that involves asking a


question, collecting data, and then examining what the data tells us about
how to procede. In a formal hypothesis test, hypotheses are always
statements about the population.

Developing Null and Alternative Hypotheses

In statistical hypothesis testing, there are always two hypotheses. The


hypothesis to be tested is called the null hypothesis and given the symbol
H0. The null hypothesis states that there is no difference between a
hypothesized population mean and a sample mean. It is the status quo
hypothesis.

For example, if we were to test the hypothesis that college freshmen study
20 hours per week, we would express our null hypothesis as:

H0 : µ = 20

We test the null hypothesis against an alternative hypothesis, which is


given the symbol Ha. The alternative hypothesis is often the hypothesis that
you believe yourself. It includes the outcomes not covered by the null
hypothesis. In this example, our alternative hypothesis would express that
freshmen do not study 20 hours per week:

Ha : µ ≠ 20

Example A

We have a medicine that is being manufactured and each pill is supposed


to have 14 milligrams of the active ingredient. What are our null and
alternative hypotheses?

Solution

H0 : µ = 14
Ha : µ ≠14

Our null hypothesis states that the population has a mean equal to 14
milligrams. Our alternative hypothesis states that the population has a
mean that is different than 14 milligrams.

Deciding Whether to Reject the Null Hypothesis: One and Two-Tailed


Hypothesis Tests

The alternative hypothesis can be supported only by rejecting the null


hypothesis. To reject the null hypothesis means to find a large enough
difference between your sample mean and the hypothesized (null) mean
that it raises real doubt that the true population mean is 20. If the difference
between the hypothesized mean and the sample mean is very large, we
reject the null hypothesis. If the difference is very small, we do not. In each
hypothesis test, we have to decide in advance what the magnitude of that
difference must be to allow us to reject the null hypothesis. Below is an
overview of this process. Notice that if we fail to find a large enough
difference to reject, we fail to reject the null hypothesis. Those are your only
two alternatives.

When a hypothesis is tested, a statistician must decide on how much of a


difference between means is necessary in order to reject the null
hypothesis.

Statisticians first choose a level of significance or alpha (α)level for their


hypothesis test. This alpha level tells us how improbable a sample mean
must be for it to be deemed "significantly different" from the hypothesized
mean. The most frequently used levels of significance are 0.05 and 0.01.
An alpha level of 0.05 means that we will consider our sample mean to be
significantly different from the hypothesized mean if the chances of
observing that sample mean are less than 5%. Similarly, an alpha level of
0.01 means that we will consider our sample mean to be significantly
different from the hypothesized mean if the chances of observing that
sample mean are less than 1%.

Two-tailed Hypothesis Tests

A hypothesis test can be one-tailed or two-tailed. The examples above are


all two-tailed hypothesis tests. We indicate that the average study time is
either 20 hours per week, or it is not. We do not specify whether we believe
the true mean to be higher or lower than the hypothesized mean. We just
believe it must be different.

In a two-tailed test, you will reject the null hypothesis if your sample mean
falls in either tail of the distribution. For this reason, the alpha level (let’s
assume .05) is split across the two tails. The curve below shows the critical
regions for a two-tailed test. These are the regions under the normal curve
that, together, sum to a probability of 0.05. Each tail has a probability of
0.025. The z-scores that designate the start of the critical region are called
the critical values.

If the sample mean taken from the population falls within these critical
regions, or "rejection regions," we would conclude that there was too much
of a difference and we would reject the null hypothesis. However, if the
mean from the sample falls in the middle of the distribution (in between the
critical regions) we would fail to reject the null hypothesis.
One-Tailed Hypothesis Test

We would use a single-tail hypothesis test when the direction of the results
is anticipated or we are only interested in one direction of the results. For
example, a single-tail hypothesis test may be used when evaluating
whether or not to adopt a new textbook. We would only decide to adopt the
textbook if it improved student achievement relative to the old textbook.

When performing a single-tail hypothesis test, our alternative hypothesis


looks a bit different. We use the symbols of greater than or less than. For
example, let’s say we were claiming that the average SAT score of
graduating seniors was GREATER than 1,110. Remember, our own
personal hypothesis is the alternative hypothesis. Then our null and
alternative hypothesis could look something like:

H0 : µ ≤ 1100

Ha : µ > 1100

In this scenario, our null hypothesis states that the mean SAT scores would
be less than or equal to 1,100 while the alternate hypothesis states that the
SAT scores would be greater than 1,100. A single-tail hypothesis test also
means that we have only one critical region because we put the entire
critical region into just one side of the distribution. When the alternative
hypothesis is that the sample mean is greater, the critical region is on the
right side of the distribution (see below). When the alternative hypothesis is
that the sample is smaller, the critical region is on the left side of the
distribution.

Type I and Type II Errors

Remember that there will be some sample means that are extremes – that
is going to happen about 5% of the time, since 95% of all sample means
fall within about two standard deviations of the mean. What happens if we
run a hypothesis test and we get an extreme sample mean? It won’t look
like our hypothesized mean, even if it comes from that distribution. We
would be likely to reject the null hypothesis. But we would be wrong.

When we decide to reject or not reject the null hypothesis, we have four
possible scenarios:

a. A true hypothesis is rejected.

b. A true hypothesis is not rejected.

c. A false hypothesis is not rejected.

d. A false hypothesis is rejected.

If a hypothesis is true and we do not reject it (Option 2) or if a false


hypothesis is rejected (Option 4), we have made the correct decision. But if
we reject a true hypothesis (Option 1) or a false hypothesis is not rejected
(Option 3) we have made an error. Overall, one type of error is not
necessarily more serious than the other. Which type is more serious
depends on the specific research situation, but ideally both types of errors
should be minimized during the analysis.

The Four Possible Outcomes in Hypothesis Testing

Decision Made Null Hypothesis is True Null Hypothesis is


False

Reject Null Hypothesis Type I Error Correct Decision

Do Not Reject Null Correct Decision Type II Error


Hypothesis

The general approach to hypothesis testing focuses on the Type I error:


rejecting the null hypothesis when it may be true. Guess what? The level of
significance, also known as the alpha level, IS the probability of making a
Type I error. At the 0.05 level, the decision to reject the hypothesis may be
incorrect 5% of the time. Calculating the probability of making a Type II
error is not as straightforward as calculating a Type I error, and we won’t
discuss that here.

You should be able to recognize what each type of error looks like in a
particular hypothesis test. For example, suppose you are testing whether
listening to rock music helps you improve your memory of 30 random
objects. Assume further that it doesn’t. A Type I error would be concluding
that listening to rock music did help memory (but you are wrong). A Type I
error will only occur when your null hypothesis is false. Let’s assume that
listening to rock music does improve memory. In this scenario, if you
concluded that it didn’t, you would be wrong again. But this time you would
be making a Type II error — failing to find a significant difference when one
in fact exists.

It is also important that you realize that the chance of making a Type I error
is under our direct control. Often we establish the alpha level based on the
severity of the consequences of making a Type I error. If the consequences
are not that serious, we could set an alpha level at 0.10 or 0.20. In other
words, we are comfortable making a decision where we could falsely reject
the null hypothesis 10 to 20% of the time. However, in a field like medical
research, we would set the alpha level very low (at 0.001 for example) if
there was potential bodily harm to patients.

Confounding Variables

Vogt (1993) defines confounding variables as variables that obscure the


effects of another variable. For the most part, confounding variables are
confounding because they serve to confuse and obfuscate both the
findings from the data, as well as the conclusions drawn from the study. In
other words, it becomes unclear whether the actual treatment caused the
effect, or the presence of the confounding variable influenced the outcome.
For a variable to be confounding it must (a) be associated with the
independent variable of interest, and (b) be directly associated with the
outcome or dependent variable.

A confounding variable is an outside influence that changes the effect of a


dependent and independent variable. This extraneous influence is used to
influence the outcome of an experimental design. Simply, a confounding
variable is an extra variable entered into the equation that was not
accounted for. Confounding variables can ruin an experiment and produce
useless results. They suggest that there are correlations when there really
are not. In an experiment, the independent variable generally has an effect
on the dependent variable.

For example, consider one instructor involving his students in a program


lasting two weeks while another instructor involves her students in a
program that lasts three days. Utilizing an outcome-based evaluation
scheme comparing how much each group learned might be misleading
because the two independent variables (length of program, and instructor
effectiveness) would be confounded. One of these variables could obscure
the effect of the other, and yet both may be directly related to the outcome
variable (in this case, learning).
Another example, if you are researching whether a lack of exercise has an
effect on weight gain, the lack of exercise is the independent variable and
weight gain is the dependent variable. A confounding variable would be any
other influence that has an effect on weight gain. Amount of food
consumption is a confounding variable, a placebo is a confounding
variable, or weather could be a confounding variable. Each may change the
effect of the experiment design.

Another example is the relationship between the force applied to a ball and
the distance the ball travels. The natural prediction would be that the ball
given the most force would travel furthest. However, if the confounding
variable is a downward slanted piece of wood to help propel the ball, the
results would be dramatically different. The slanted wood is the
confounding variable that changes the outcome of the experiment.

Numerous variables can confound or confuse the findings from a research


study. Validity issues such as whether the treatment or experience actually
created the findings often become more obscure when faced with the
presence of one or more of these confounding variables. Understanding
what these variables are, and how they influence the findings, can be
helpful in dealing with this issue and finding possible solutions.

In order to reduce confounding variables, make sure all the confounding


variables are identified in the study. Make a list of everything thought of,
one by one, and consider whether those listed items might influence the
outcome of the study. Understanding the confounding variables will result
in more accurate results.

Exploratory Data Analysis

Exploratory data analysis or “EDA” is a critical first step in analyzing the


data from an experiment. Here are the main reasons we use EDA:
• detection of mistakes
• checking of assumptions
• preliminary selection of appropriate models
• determining relationships among the explanatory variables, and
• assessing the direction and rough size of relationships between
explanatory and outcome variables.

Typical data format and the types of EDA The data from an experiment are
generally collected into a rectangular array (e.g., spreadsheet or database),
most commonly with one row per experimental subject and one column for
each subject identifier, outcome variable, and explanatory variable. Each
column contains the numeric values for a particular quantitative variable or
the levels for a categorical variable. (Some more complicated experiments
require a more complex data layout.)
People are not very good at looking at a column of numbers or a
whole spreadsheet and then determining important characteristics of the
data. They find looking at numbers to be tedious, boring, and/or
overwhelming. Exploratory data analysis techniques have been devised as
an aid in this situation. Most of these techniques work in part by hiding
certain aspects of the data while making other aspects more clear.
Exploratory data analysis is generally cross-classified in two ways.
First, each method is either non-graphical or graphical. And second, each
method is either univariate or multivariate (usually just bivariate).
Non-graphical methods generally involve calculation of summary
statistics, while graphical methods obviously summarize the data in a
diagrammatic or pictorial way. Univariate methods look at one variable
(data column) at a time, while multivariate methods look at two or more
variables at a time to explore relationships. Usually our multivariate EDA
will be bivariate (looking at exactly two variables), but occasionally it will
involve three or more variables. It is almost always a good idea to perform
univariate EDA on each of the components of a multivariate EDA before
performing the multivariate EDA.
Beyond the four categories created by the above cross-classification,
each of the categories of EDA have further divisions based on the role
(outcome or explanatory) and type (categorical or quantitative) of the
variable(s) being examined.
The four types of EDA are univariate non-graphical, multivariate
nongraphical, univariate graphical, and multivariate graphical.

Univariate non-graphical EDA


The data that come from making a particular measurement on all of
the subjects in a sample represent our observations for a single
characteristic such as age, gender, speed at a task, or response to a
stimulus. We should think of these measurements as representing a
“sample distribution” of the variable, which in turn more or less represents
the “population distribution” of the variable. The usual goal of univariate
non-graphical EDA is to better appreciate the “sample distribution” and also
to make some tentative conclusions about what population distribution(s)
is/are compatible with the sample distribution.

Categorical data
The characteristics of interest for a categorical variable are simply the
range of values and the frequency (or relative frequency) of occurrence for
each value. (For ordinal variables it is sometimes appropriate to treat them
as quantitative variables using the techniques in the second part of this
section.) Therefore the only useful univariate non-graphical techniques for
categorical variables is some form of tabulation of the frequencies, usually
along with calculation of the fraction (or percent) of data that falls in each
category.

Categorical data represent the distribution of samples into several


mutually exclusive categories, which usually involves counting how many
objects are in each qualitative category. The sum of categorical data
typically equals 100%. For example, the percentage of female and male
students in a class could be 51% and 49%. Categorical data is commonly
found in the study of genetics (Hartl and Jones, 2005; Klug and Cummings,
2006). In fact, the categorical data Mendel generated from his work on
peas helped to define genetic inheritance. A simple tabulation of the
frequency of each category is the best univariate non-graphical EDA for
categorical data.

Characteristics of quantitative data


Univariate EDA for a quantitative variable is a way to make
preliminary assessments about the population distribution of the variable
using the data of the observed sample.
If the quantitative variable does not have too many distinct values, a
tabulation, as we used for categorical data, will be a worthwhile univariate,
non-graphical technique. But mostly, for quantitative variables we are
concerned here with the quantitative numeric (non-graphical) measures
which are the various sample statistics. In fact, sample statistics are
generally thought of as estimates of the corresponding population
parameters.

Central tendency
The central tendency or “location” of a distribution has to do with
typical or middle values. The common, useful measures of central tendency
are the statistics called (arithmetic) mean, median, and sometimes mode.
The arithmetic mean is simply the sum of all of the data values
divided by the number of values. It can be thought of as how much each
subject gets in a “fair” re-division of whatever the data are measuring. For
instance, the mean amount of money that a group of people have is the
amount each would get if all of the money were put in one “pot”, and then
the money was redistributed to all people evenly. For any symmetrically
shaped distribution (i.e., one with a symmetric histogram or pdf or pmf) the
mean is the point around which the symmetry holds. For non-symmetric
distributions, the mean is the “balance point”.
The median is another measure of central tendency. The sample
median is the middle value after all of the values are put in an ordered list.
If there are an even number of values, take the average of the two middle
values. (If there are ties at the middle, some special adjustments are made
by the statistical software we will use. In unusual situations for discrete
random variables, there may not be a unique median.)
For symmetric distributions, the mean and the median coincide. For
unimodal skewed (asymmetric) distributions, the mean is farther in the
direction of the “pulled out tail” of the distribution than the median is.
Therefore, for many cases of skewed distributions, the median is preferred
as a measure of central tendency. For example, according to the US
Census Bureau 2004 Economic Survey, the median income of US families,
which represents the income above and below which half of families fall,
was $43,318. This seems a better measure of central tendency than the
mean of $60,828, which indicates how much each family would have if we
all shared equally. And the difference between these two numbers is quite
substantial. Nevertheless, both numbers are “correct”, as long as you
understand their meanings.
The median has a very special property called robustness. A sample
statistic is “robust” if moving some data tends not to change the value of
the statistic. The median is highly robust, because you can move nearly all
of the upper half and/or lower half of the data values any distance away
from the median without changing the median. More practically, a few very
high values or very low values usually have no effect on the median.
A rarely used measure of central tendency is the mode, which is the
most likely or frequently occurring value. More commonly we simply use
the term “mode” when describing whether a distribution has a single peak
(unimodal) or two or more peaks (bimodal or multi-modal). In symmetric,
unimodal distributions, the mode equals both the mean and the median. In
unimodal, skewed distributions the mode is on the other side of the median
from the mean. In multi-modal distributions there is either no unique highest
mode, or the highest mode may well be unrepresentative of the central
tendency.

Spread
Several statistics are commonly used as a measure of the spread of
a distribution, including variance, standard deviation, and interquartile
range. Spread is an indicator of how far away from the center we are still
likely to find data values.
The variance is a standard measure of spread. It is calculated for a
list of numbers, e.g., the n observations of a particular measurement
labeled x1 through xn, based on the n sample deviations (or just
“deviations”). The variance of a population is defined as the mean squared
deviation. The sample formula for the variance of observed data
conventionally has n−1 in the denominator instead of n to achieve the
property of “unbiasedness”, which roughly means that when calculated for
many different random samples from the same population, the average
should match the corresponding population quantity. The most commonly
used symbol for sample variance is s2, which is essentially the average of
the squared deviations, except for dividing by n − 1 instead of n. This is a
measure of spread, because the bigger the deviations from the mean, the
bigger the variance gets. (In most cases, squaring is better than taking the
absolute value because it puts special emphasis on highly deviant values.)
Because of the square, variances are always non-negative, and they
have the somewhat unusual property of having squared units compared to
the original data. So if the random variable of interest is a temperature in
degrees, the variance has units “degrees squared”, and if the variable is
area in square kilometers, the variance is in units of “kilometers to the
fourth power”.
The standard deviation is simply the square root of the variance.
Therefore, it has the same units as the original data, which helps make it
more interpretable. The sample standard deviation is usually represented
by the symbol s.
The variance and standard deviation are two useful measures of
spread. The variance is the mean of the squares of the individual
deviations. The standard deviation is the square root of the variance. For
Normally distributed data, approximately 95% of the values lie within 2 sd
of the mean.
A third measure of spread is the interquartile range. To define IQR,
we first need to define the concepts of quartiles. The quartiles of a
population or a sample are the three values which divide the distribution or
observed data into even fourths. So one quarter of the data fall below the
first quartile, usually written Q1; one half fall below the second quartile
(Q2); and three fourths fall below the third quartile (Q3). The astute reader
will realize that half of the values fall above Q2, one quarter fall above Q3,
and also that Q2 is a synonym for the median. Once the quartiles are
defined, it is easy to define the IQR as IQR = Q3 − Q1. By definition, half of
the values (and specifically the middle half) fall within an interval whose
width equals the IQR. If the data are more spread out, then the IQR tends
to increase, and vice versa.
The IQR is a more robust measure of spread than the variance or
standard deviation. Any number of values in the top or bottom quarters of
the data can be moved any distance from the median without affecting the
IQR at all.
In contrast to the IQR, the range of the data is not very robust at all.
The range of a sample is the distance from the minimum value to the
maximum value: range = maximum - minimum. If you collect repeated
samples from a population, the minimum, maximum and range tend to
change drastically from sample to sample, while the variance and standard
deviation change less, and the IQR least of all.

Skewness and kurtosis


Two additional useful univariate descriptors are the skewness and
kurtosis of a distribution. Skewness is a measure of asymmetry. Kurtosis is
a measure of “peakedness” relative to a Gaussian shape. Sample
estimates of skewness and kurtosis are taken as estimates of the
corresponding population parameters. If the sample skewness and kurtosis
are calculated along with their standard errors, we can roughly make
conclusions according to the following table where e is an estimate of
skewness and u is an estimate of kurtosis, and SE(e) and SE(u) are the
corresponding standard errors.
Skewness (e) or kurtosis (u) Conclusion
−2SE(e) < e < 2SE(e) not skewed
e ≤ −2SE(e) negative skew
e ≥ 2SE(e) positive skew
−2SE(u) < u < 2SE(u) not kurtotic
u ≤ −2SE(u) negative kurtosis
u ≥ 2SE(u) positive kurtosis
For a positive skew, values far above the mode are more common
than values far below, and the reverse is true for a negative skew. When a
sample (or distribution) has positive kurtosis, then compared to a Gaussian
distribution with the same variance or standard deviation, values far from
the mean (or median or mode) are more likely, and the shape of the
histogram is peaked in the middle, but with fatter tails. For a negative
kurtosis, the peak is sometimes described as having “broader shoulders”
than a Gaussian shape, and the tails are thinner, so that extreme values
are less likely.

The Gaussian shape - The graph of a Gaussian is a characteristic


symmetric "bell curve" shape.

Univariate graphical EDA


While the non-graphical methods are quantitative and objective, they
do not give a full picture of the data; therefore, graphical methods, which
are more qualitative and involve a degree of subjective analysis, are also
required.
Histograms
The only one of these techniques that makes sense for categorical
data is the histogram (basically just a barplot of the tabulation of the data).
A pie chart is equivalent, but not often used. The concepts of central
tendency, spread and skew have no meaning for nominal categorical data.
For ordinal categorical data, it sometimes makes sense to treat the data as
quantitative for EDA purposes; you need to use your judgment here.
The most basic graph is the histogram, which is a barplot in which
each bar represents the frequency (count) or proportion (count/total count)
of cases for a range of values. Typically the bars run vertically with the
count (or proportion) axis running vertically. To manually construct a
histogram, define the range of data for each bar (called a bin), count how
many cases fall in each bin, and draw the bars high enough to indicate the
count.
Besides getting the general impression of the shape of the
distribution, you can read off facts like “there are two cases with data
values between 1 and 2” and “there are 9 cases with data values between
2 and 3”. Generally values that fall exactly on the boundary between two
bins are put in the lower bin, but this rule is not always followed.
Generally you will choose between about 5 and 30 bins, depending
on the amount of data and the shape of the distribution. Of course you
need to see the histogram to know the shape of the distribution, so this
may be an iterative process. It is often worthwhile to try a few different bin
sizes/numbers because, especially with small samples, there may
sometimes be a different shape to the histogram when the bin size
changes. But usually the difference is small. Histograms are one of the best
ways to quickly learn a lot about your data, including central tendency,
spread, modality, shape and outliers.
Stem-and-leaf plots
A simple substitute for a histogram is a stem and leaf plot. A stem
and leaf plot is sometimes easier to make by hand than a histogram, and it
tends not to hide any information. Nevertheless, a histogram is generally
considered better for appreciating the shape of a sample distribution than is
the stem and leaf plot. A stem and leaf plot shows all data values and the
shape of the distribution.

Boxplots
Another very useful univariate graphical technique is the boxplot.
Boxplots are very good at presenting information about the central
tendency, symmetry and skew, as well as outliers, although they can be
misleading about some aspects. One of the best uses of boxplots is in the
form of side-by-side boxplots.

Important: The term “outlier” is not well defined in statistics, and the
definition varies depending on the purpose and situation. The “outliers”
identified by a boxplot, which could be called “boxplot outliers” are defined
as any points more than 1.5 IQRs above Q3 or more than 1.5 IQRs below
Q1. This does not by itself indicate a problem with those data points.
Boxplots are an exploratory technique, and you should consider
designation as a boxplot outlier as just a suggestion that the points might
be mistakes or otherwise unusual. Also, points not designated as boxplot
outliers may also be mistakes. It is also important to realize that the number
of boxplot outliers depends strongly on the size of the sample. For data that
is perfectly Normally distributed, we expect 0.70 percent (or about 1 in 150
cases) to be “boxplot outliers”, with approximately half in either direction.

The term fat tails is used to describe the situation where a histogram
has a lot of values far from the mean relative to a Gaussian distribution.
This corresponds to positive kurtosis. In a boxplot, many outliers (more
than the 1/150 expected for a Normal distribution) suggests fat tails
(positive kurtosis), or possibly many data entry errors. Boxplots are
excellent EDA plots because they rely on robust statistics like median and
IQR rather than more sensitive ones such as mean and standard deviation.
With boxplots it is easy to compare distributions with a high degree of
reliability because of the use of these robust statistics.

Quantile-normal plots

The final univariate graphical EDA technique is called the quantile-


normal or QN plot or more generality the quantile-quantile or QQ plot. It is
used to see how well a particular sample follows a particular theoretical
distribution. Many statistical tests have the assumption that the outcome for
any fixed set of values of the explanatory variables is approximately
normally distributed, and that is why QN plots are useful: if the assumption
is grossly violated, the p-value and confidence intervals of those tests are
wrong. As we will see in the ANOVA and regression chapters, the most
important situation where we use a QN plot is not for EDA, but for
examining something called “residuals”.
Multivariate non-graphical EDA

Multivariate non-graphical EDA techniques generally show the


relationship between two or more variables in the form of either cross-
tabulation or statistics.

Cross-tabulation
For categorical data (and quantitative data with only a few different
values) an extension of tabulation called cross-tabulation is very useful. For
two variables, cross-tabulation is performed by making a two-way table with
column headings that match the levels of one variable and row headings
that match the levels of the other variable, then filling in the counts of all
subjects that share a pair of levels. The two variables might be both
explanatory, both outcome, or one of each. Depending on the goals, row
percentages (which add to 100% for each row), column percentages (which
add to 100% for each column) and/or cell percentages (which add to 100%
over all cells) are also useful. Cross-tabulation is the basic bivariate non-
graphical EDA technique.

Correlation and covariance


For two quantitative variables, the basic statistics of interest are the
sample covariance and/or sample correlation. The sample covariance is a
measure of how much two variables “co-vary”, i.e., how much (and in what
direction) should we expect one variable to change when the other
changes. Positive covariance values suggest that when one measurement
is above the mean the other will probably also be above the mean, and vice
versa. Negative covariances suggest that when one variable is above its
mean, the other is below its mean. And covariances near zero suggest that
the two variables vary independently of each other.
Covariances tend to be hard to interpret, so we often use correlation
instead. The correlation has the nice property that it is always between -1
and +1, with -1 being a “perfect” negative linear correlation, +1 being a
perfect positive linear correlation and 0 indicating that X and Y are
uncorrelated. The correlation between two random variables is a number
that runs from -1 through 0 to +1 and indicates a strong inverse
relationship, no relationship, and a strong direct relationship, respectively.

Multivariate graphical EDA

There are few useful techniques for graphical EDA of two categorical
random variables. The only one used commonly is a grouped barplot with
each group representing one level of one of the variables and each bar
within a group representing the levels of the other variable.

Scatterplots
For two quantitative variables, the basic graphical EDA technique is
the scatterplot which has one variable on the x-axis, one on the y-axis and
a point for each case in your dataset. If one variable is explanatory and the
other is outcome, it is a very, very strong convention to put the outcome on
the y (vertical) axis.
One or two additional categorical variables can be accommodated on
the scatterplot by encoding the additional information in the symbol type
and/or color.

You should always perform appropriate EDA before further analysis of your data.
Perform whatever steps are necessary to become more familiar with your data, and
check for obvious mistakes. EDA is not an exact science – it is a very important
art!
Parametric tests and non-parametric tests

Parametric tests involve fairly restrictive assumptions, usually that the


populations being compared have Normal distributions and equal
variances. They include t-tests, ANOV A and regression. However,
because these conditions are often met by the kinds of populations studied
in environmental and biological sciences, they are the most frequently
used.
Non-parametric tests, also known as distribution-free tests, have
been devised to cope with situations where the populations are not
Normally distributed, or where it is not known. Despite the name, most of
them have some assumptions about the shape of the distribution, but they
are less stringent, e.g. one test requires that the distributions must be the
same shape but not necessarily Normal.
It might seem sensible to use non-parametric tests all the time to
avoid the risk of making an incorrect assumption. However, just as you or I
would make use of the information that two samples came from Normal
distributions, when comparing them, so does a parametric test. Non-
parametric tests have only the values in the samples themselves to go on
and generally produce less conclusive results. Parametric tests are
therefore more powerful than their non-parametric equivalents and should
be used whenever the assumptions of the test are valid.

Common assumptions

Independent random samples


These are assumptions about the way you have collected the data. In
effect, the tests assume you have collected samples that are truly
representative of the population.

Independent measurements or observations


It is assumed that each individual in the sample would behave in the
same way, regardless of how the other members of the sample behave. If
we want to know how long it takes an average sheep to find its way out of a
maze, we might choose to study 10 sheep, i.e. 10 replicates. We would
need to test each sheep on its own because if we put them all in together
they will behave as a flock. In that case, if anyone of the 10 found its way
out, the rest would probably all follow. Statistical tests assume that no
member or sample is influenced in its behaviour by the others. If we want to
study the behaviour of sheep in flocks, we need to have independent
replicate flocks, not treat the individual sheep within a flock as replicates.

Random sampling
The individuals or individual points in a sample should be selected by some
random process (e.g. a series of random numbers from a computer) in
such a way that every individual or point in the population has an equal
chance of being selected. If we do this, we might get mostly unusually large
or unusually small values in the sample. The tests assume samples have
been selected in this way and the probabilities given by the tests allow for
this.

Normal distributions
The Normal distributions assumption relates to the distributions of the
populations being studied, not the samples themselves. For us to accept this
assumption, it must be reasonable on theoretical grounds, i.e. we must expect that
values will be concentrated symmetrically round some mean value, and any
previous research should not contradict this.
The distribution of values in the samples should also appear approximately
Normal. Many computer packages will draw a frequency distribution for you (the
function is sometimes called Histogram). An advantage of this is that it gives you a
statistic with which to argue your case for accepting or rejecting that the data come
from a Normal distribution.

Equal variance
Equal variance is also sometimes referred to as homogeneous
variance, stable variance, constant variance, or homoscedasticity. What it
means is that to give accurate results, statistical tests often require the
'spread', technically the variance, of individual values to be the same in
each of the populations we are comparing. As for the assumption of
Normality, if we are to accept this assumption it must be reasonable on
theoretical grounds and not contradicted by the data in the samples.
Unequal variance occurs quite commonly because groups with high values
tend to have more spread in their values than groups with low values. For
example unfertilized plants might range in height from 10 to 15 cm,
whereas in a fertilized treatment most are between 20 and 30 cm. Not only
is the mean greater in the fertilized treatment, but also the spread of
values.

Learning Task:
1. What is hypothesis testing?
2. Discuss at least two examples of exploratory data analysis and
presentation.
3. Explain some of the common assumptions in statistical tests.
III- Statistical Methods

Learning Objectives:
1. Evaluate selected tests for nominal variables including exact test of
goodness-of-fit, power analysis, Chi-square tests of goodness-of-fit,
and Fisher’s exact test.
2. Evaluate some examples of descriptive statistics including measures
of central tendency, dispersion, standard error and confidence limits.
3. Evaluate selected tests for one measurement variables including one-
sample t-test, two-sample t-test, independence, normality.

The main goal of a statistical test is to answer the question, “What is the
probability of getting a result like my observed data, if the null hypothesis
were true?” If it is very unlikely to get the observed data under the null
hypothesis, you reject the null hypothesis. Most statistical tests take the
following form:

1. Collect the data.

2. Calculate a number, the test statistic, that measures how far the
observed data deviate from the expectation under the null hypothesis.

3. Use a mathematical function to estimate the probability of getting a test


statistic as extreme as the one you observed, if the null hypothesis were
true. This is the P value.

The Null Hypothesis

• For a two-tailed test, which is what you almost always should use:

• The null hypothesis is that the number of observations in each


category is equal to that predicted by a biological theory.

• The alternative hypothesis is that the observed data are


different from the expected.
• One-tailed test

• The null hypothesis is that the observed number for one


category is equal to or less than the expected.

• The alternative hypothesis is that the observed number in that


category is greater than expected.

Exact Test of Goodness-Of-Fit

• Exact tests, such as the exact test of goodness-of-fit, are different.


There is no test statistic.

• Directly calculate the probability of obtaining the observed data under


the null hypothesis. This is because the predictions of the null
hypothesis are so simple that the probabilities can easily be
calculated.

• You use the exact test of goodness-of-fit when you have one nominal
variable, you want to see whether the number of observations in each
category fits a theoretical expectation, and the sample size is small.

Exact Binomial Test

The most common use is a nominal variable with only two values (such as
male or female, left or right, green or yellow), in which case the test may be
called the exact binomial test.

You compare the observed data with the expected data, which are some
kind of theoretical expectation (such as a 1:1 sex ratio or a 3:1 ratio in a
genetic cross) that you determined before you collected the data.

Exact Binomial Test Study/Example

• Yukilevich and True (2008) mixed 30 male and 30 female Drosophila


melanogaster from Alabama with 30 male and 30 females from
Grand Bahama Island.
• They observed 246 matings; 140 were homotypic (male and female
from the same location), while 106 were heterotypic (male and female
from different locations).

• The null hypothesis is that the flies mate at random, so that there
should be equal numbers of homotypic and heterotypic matings.

• There were significantly more homotypic matings (exact binomial test,


P=0.035) than heterotypic.

Sign Test

• One common application of the exact binomial test is known as the


sign test.

• You use the sign test when there are two nominal variables and one
measurement variable. One of the nominal variables has only two
values, such as “before” and “after” or “left” and “right,” and the other
nominal variable identifies the pairs of observations.

Ex: a. In a study of a hair-growth ointment, “amount of hair” would be the


measurement variable.

b. “before” and “after” would be the values of one nominal variable.

c. “Arnold,” “Bob,” “Charles” would be values of the second nominal


variable.

• The data for a sign test usually could be analyzed using a paired
t–test if the null hypothesis is that the mean or median difference
between pairs of observations is zero.

• However, sometimes you’re not interested in the size of the


difference, just the direction. In the hair-growth example, you might
have decided that you didn’t care how much hair the men grew or
lost, you just wanted to know whether more than half of the men grew
hair.

• In that case, you count the number of differences in one direction,


count the number of differences in the opposite direction, and use the
exact binomial test to see whether the numbers are different from a
1:1 ratio.

Sign Test Study/Example

• As an example of the sign test, Farrell et al. (2001) estimated the


evolutionary tree of two subfamilies of beetles that burrow inside
trees as adults.

• They found ten pairs of sister groups in which one group of related
species, or “clade,” fed on angiosperms and one fed on
gymnosperms, and they counted the number of species in each clade

• There are two nominal variables, food source (angiosperms or


gymnosperms) and pair of clades (Corthylina vs. Pityophthorus, etc.)

• One measurement variable, which is the number of species per


clade.

• Applying a sign test, there are 10 pairs of clades in which the


angiosperm-specialized clade has more species, and 0 pairs with
more species in the gymnosperm-specialized clade; this is
significantly different from the null expectation (P=0.002), and you
can reject the null hypothesis and conclude that in these beetles,
clades that feed on angiosperms tend to have more species than
clades that feed on gymnosperms.

Exact multinomial test

• While the most common use of exact tests of goodness-of-fit is the


exact binomial test, it is also possible to perform exact multinomial
tests when there are more than two values of the nominal variable.

• The basic procedure is the same as for the exact binomial test: you
calculate the probabilities of the observed result and all more extreme
possible results and add them together.
Assumptions

Goodness-of-fit tests assume that the individual observations are


independent, meaning that the value of one observation does not influence
the value of other observations.

To give an example, let’s say you want to know what color of flowers that
bees like. You plant four plots of flowers: one purple, one red, one blue,
and one white. You get a bee, put it in a dark jar, carry it to a point
equidistant from the four plots of flowers, and release it.

You record which color flower it goes to first, then re-capture it and hold it
prisoner until the experiment is done.

You do this again and again for 100 bees.

In this case, the observations are independent; the fact that bee #1 went to
a blue flower has no influence on where bee #2 goes.

This is a good experiment; if significantly more than 1/4 of the bees go to


the blue flowers, it would be good evidence that the bees prefer blue
flowers.

Now let’s say that you put a beehive at the point equidistant from the four
plots of flowers, and you record where the first 100 bees go.

If the first bee happens to go to the plot of blue flowers, it will go back to the
hive and do its bee-butt-wiggling dance that tells the other bees, “Go 15
meters southwest, there’s a bunch of yummy nectar there!”

Then some more bees will fly to the blue flowers, and when they return to
the hive, they’ll do the same bee-butt-wiggling dance.

The observations are NOT independent; where bee #2 goes is strongly


influenced by where bee #1 happened to go.
Power analysis

• Before you do an experiment, you should do a power analysis to


estimate the sample size you’ll need.

• When you are designing an experiment, it is a good idea to estimate


the sample size you’ll need. This is especially true if you’re proposing
to do something painful to vertebrates, where it is particularly
important to minimize the number of individuals (without making the
sample size so small that the whole experiment is a waste of time
and suffering), or if you’re planning a very time-consuming or
expensive experiment.

• Methods have been developed for many statistical tests to estimate


the sample size needed to detect a particular effect, or to estimate
the size of the effect that can be detected with a particular sample
size.

• In order to do a power analysis, you need to specify an effect size.

• Effect size- the size of the difference between your null


hypothesis and the alternative hypothesis that you hope to
detect.

• For applied and clinical biological research, there may be a very


definite effect size that you want to detect.

• For example, if you’re testing a new dog shampoo, the marketing


department at your company may tell you that producing the new
shampoo would only be worthwhile if it made dogs’ coats at least
25% shinier, on average. That would be your effect size, and you
would use it when deciding how many dogs you would need to put
through the canine reflectometer.

Parameters

There are four or five numbers involved in a power analysis. You


must choose the values for each one before you do the analysis. If you
don’t have a good reason for using a particular value, you can try different
values and look at the effect on sample size.

Effect size

The effect size is the minimum deviation from the null hypothesis that you
hope to detect.

Example:

• If you are treating hens with something that you hope will change the
sex ratio of their chicks, you might decide that the minimum change in
the proportion of sexes that you’re looking for is 10%.

• You would then say that your effect size is 10%. If you’re testing
something to make the hens lay more eggs, the effect size might be 2
eggs per month.

Alpha

• Alpha is the significance level of the test (the P value), the probability
of rejecting the null hypothesis even though it is true (a false positive).

• The usual value is alpha=0.05.

• Some power calculators use the one-tailed alpha, which is confusing,


since the two-tailed alpha is much more common.

Beta or power

Beta, in a power analysis, is the probability of accepting the null


hypothesis, even though it is false (a false negative), when the real
difference is equal to the minimum effect size.

The power of a test is the probability of rejecting the null hypothesis


(getting a significant result) when the real difference is equal to the
minimum effect size.
Standard deviation

• For measurement variables, you also need an estimate of the


standard deviation. As standard deviation gets bigger, it gets harder
to detect a significant difference, so you’ll need a bigger sample size.

• Your estimate of the standard deviation can come from pilot


experiments or from similar experiments in the published literature.

• For nominal variables, the standard deviation is a simple function of


the sample size, so you don’t need to estimate it separately.

Example

• You plan to cross peas that are heterozygotes for Yellow/green pea
color, where Yellow is dominant. The expected ratio in the offspring is
3 Yellow: 1 green.

• You want to know whether yellow peas are actually more or less fit,
which might show up as a different proportion of yellow peas than
expected. You arbitrarily decide that you want a sample size that will
detect a significant (P<0.05) difference if there are 3% more or fewer
yellow peas than expected, with a power of 90%.

You will test the data using the exact binomial test of goodness-of-fit if the
sample size is small enough, or a G–test of goodnessof-fit if the sample
size is larger. The power analysis is the same for both tests

Conclusion of the Test

• Using G*Power for the exact test of goodness-of-fit, the result is that
it would take 2190 pea plants if you want to get a significant (P<0.05)
result 90% of the time, if the true proportion of yellow peas is 78 or
72%.

• That’s a lot of peas, but you’re reassured to see that it’s not a
ridiculous number. If you want to detect a difference of 0.1% between
the expected and observed numbers of yellow peas, you can
calculate that you’ll need 1,970,142 peas.
• If that’s what you need to detect, the sample size analysis tells you
that you’re going to have to include a pea-sorting robot in your
budget.

Note: G*Power (www.gpower.hhu.de/) is an excellent free program,


available for Mac and Windows, that will do power analyses for a large
variety of tests.

The power of a statistical test of a null hypothesis is the probability that it


will lead to the rejection of the null hypothesis, i.e., the probability that it will
result in the conclusion that the phenomenon exists.

Given the characteristics of a specific statistical test of the null


hypothesis and the state of affairs in the population, the power of the test
can be determined. It clearly represents a vital piece of information about a
statistical test applied to research data (cf. Cohen, 1962).

For example, the discovery, during the planning phase of an


investigation, that the power of the eventual statistical test is low should
lead to a revision in the plans.

As another example, consider a completed experiment which led to


nonrejection of the null hypothesis.

An analysis which finds that the power was low should lead one to
regard the negative results as ambiguous, since failure to reject the null
hypothesis cannot have much substantive meaning when, even though the
phenomenon exists (to some given degree), the probability of rejecting the
null hypothesis was low.

The Chi-squared Test

The chi-squared test is used to compare the distribution of a categorical


variable in a sample or a group with the distribution in another one. If the
distribution of the categorical variable is not much different over different
groups, we can conclude the distribution of the categorical variable is not
related to the variable of groups. Or we can say the categorical variable
and groups are independent. For example, if men have a specific condition
more than women, there is bigger chance to find a person with the
condition among men than among women. We don't think gender is
independent from the condition. If there is equal chance of having the
condition among men and women, we will find the chance of observing the
condition is the same regardless of gender and can conclude their
relationship as independent.

The chi-squared test performs an independency test under following


null and alternative hypotheses, H0 and H1, respectively.
H0: Independent (no association)
H1: Not independent (association)
The test statistic of chi-squared test: χ2=∑(0−E)2 – X2
E
with degrees of freedom (r - 1)(c - 1), Where O and E represent observed
and expected frequency, and r and c is the number of rows and columns of
the contingency table.

For a 2 x 2 contingency table the Chi Square statistic is calculated by the


formula:
Note: notice that the four components of the denominator are the four totals
from the table columns and rows.

X2 = (ad-bc)2 (a+b+c+d)
(a+b)(c+d)(b+d)(a+c)

We reject the null hypothesis of independence if the calculated chi-


squared statistic is larger than the critical value from the chi-squared
distribution. In the chi-squared distribution, the critical values are 3.84,
5.99, 7.82, and 9.49, with corresponding degrees of freedom of 1, 2, 3, and
4, respectively, at an alpha level of 0.5.

FISHER’S EXACT TEST

Chi-square test is not suitable when the sample is small. For studies
with small samples, the best method to apply is the Fisher’s exact test.
Independence tests are used to determine if there is a significant
relationship between two categorical variables. There exists two different
types of independence test:

 the Chi-square test (the most common)


 the Fisher’s exact test

The Chi-square test is used when the sample is large enough (in this
case the p-value is an approximation that becomes exact when the sample
becomes infinite, which is the case for many statistical tests). On the other
hand, the Fisher’s exact test is used when the sample is small (and in this
case the p-value is exact and is not an approximation).

The p-value of the Fisher’s exact test is the sum of hypergeometric


probabilities for outcomes at least as favorable to the alternative as the
observed outcome.

The statistical hypothesis that can be formulated for Fisher exact test
is exactly the same as that for chi-square test. If the computed probability
value of the Fisher exact test is less than the standard cut-off p-value of
0.05, then we reject the null hypothesis and conclude that there is an
association between the column variable and the row variable or the
proportion with the characteristic of interest is not the same in both
populations.

The Fisher’s exact test calculates the exact probability of the table of
observed cell frequencies given the following assumptions:

„ The null hypothesis of independence is true

„ The marginal totals of the observed table are fixed

„ Calculation of the probability of the observed cell frequencies uses


the factorial mathematical operation.

„ Factorial is notated by ! which means multiply the number by all


integers smaller than the number „ Example: 7! = 7*6*5*4*3*2*1 = 5040.

Fisher’s Exact Test: Calculation

a b a+b
c d c+d

a+c b+d n

If margins of a table are fixed, the exact probability of a table with cells
a,b,c,d and marginal totals (a+b), (c+d), (a+c), (b+d) =

(a+b)!*(c+d)!*(a+c)!*(b+d)!

n!*a!*b!*c!*d!*

Fisher’s Exact Test: Calculation Example

1 8 9

4 5 9

5 13 18

The exact probability of this table =

9! * 9! * 13! * 5! = 136080 = 0.132

18! * 1! * 8! * 4! * 5! = 1028160

The p-value for the Fisher’s exact test is calculated by summing all
probabilities less than or equal to the probability of the observed table.

„ The probability is smallest for the tables that are least likely to occur
by chance if the null hypothesis of independence is true.

Hypotheses

The hypotheses of the Fisher’s exact test are the same than for the Chi-
square test, that is:
 H0: the variables are independent, there is no relationship between
the two categorical variables. Knowing the value of one variable does
not help to predict the value of the other variable
 H1: the variables are dependent, there is a relationship between the
two categorical variables. Knowing the value of one variable helps to
predict the value of the other variable

Remember that the Fisher’s exact test is used when there is at least one
cell in the contingency table of the expected frequencies below 5. Fisher's
exact test is practically applied only in analysis of small samples but
actually it is valid for all sample sizes. While the chi-squared test relies on
an approximation, Fisher's exact test is one of exact tests. Especially when
more than 20% of cells have expected frequencies < 5, we need to use
Fisher's exact test because applying approximation method is inadequate.

Fisher's exact test assesses the null hypothesis of independence


applying hypergeometric distribution of the numbers in the cells of the
table. Because of the complicated computation for the Fisher exact test, we
rely on statistical software to do the computation such as the SPSS
statistical package.

Central Tendency

How well did my students do on the last test? What is the average
price of gasoline in the Phoenix metropolitan area? What is the mean
number of home runs hit in the National League? These questions are
asking for a statistic that describes a large set of data. In this section we
will study the mean, median, and mode. These three statistics describe an
average or center of a distribution of numbers.

Sigma notation Σ

• The sigma notation is a shorthand notation used to sum up a large


number of terms.
• Σx = x1+x2+x3+ … +xn
• One uses this notation because it is more convenient to write the sum
in this fashion.
Definition of the mean

• Given a sample of n data points, x1, x2, x3, … xn, the formula for the
mean or average is given below.

the sum of the data pts


x 
x 
n the number data pts

Find the mean.

• My 5 test scores for Calculus I are 95, 83, 92, 81, 75. What is the
mean?
• ANSWER: sum up all the tests and divide by the total number of
tests.
• Test mean = (95+83+92+81+75)/5 = 85.2

Example with a range of data

When you are given a range of data, you need to find midpoints. To find a
midpoint, sum the two endpoints on the range and divide by 2. Example
14≤x<18. The midpoint (14+18)/2=16. The total number of students is
5,542,000.

Age of males Number of


students
14≤x<18 94,000
18≤x<20 1,551,000
20≤x<22 1,420,000
22≤x<25 1,091,000
25≤x<30 865,000
30≤x<35 521,000
Total 5,542,000
• What we need to do is find the midpoints of the ranges and then
multiply them by the frequency. So that we can compute the mean.
• The midpoints are 16, 19, 21, 23.5, 27.5, 32.5.
• The mean is [16(94,000)+19(1,551,000)+21(1,420,000)+
23.5(1,091,000)+27.5(865,000)+32.5(521,000)] /5,542,000.=22.94

The median

The median is the middle value of a distribution of data.

How do you find the median? First, if possible or feasible, arrange the data
from smallest value to largest value. The location of the median can be
calculated using this formula: (n+1)/2. If (n+1)/2 is a whole number then
that value gives the location. Just report the value of that location as the
median.

If (n+1)/2 is not a whole number then the first whole number less than the
location value and the first whole number greater than the location value
will be used to calculate the median. Take the data located at those 2
values and calculate the average, this is the median.

Find the median.

• Here are a bunch of 10 point quizzes from MAT117:


• 9, 6, 7, 10, 9, 4, 9, 2, 9, 10, 7, 7, 5, 6, 7
• As you can see there are 15 data points.
• Now arrange the data points in order from smallest to largest.
• 2, 4, 5, 6, 6, 7, 7, 7, 7, 9, 9, 9, 9, 10, 10
• Calculate the location of the median: (15+1)/2=8. The eighth piece of
data is the median. Thus the median is 7.
• By the way what is the mean???? It’s 7.13…

The mode

• The mode is the most frequent number in a collection of data.


• Example A: 3, 10, 8, 8, 7, 8, 10, 3, 3, 3
• The mode of the above example is 3, because 3 has a frequency of
4.
• Example B: 2, 5, 1, 5, 1, 2
• This example has no mode because 1, 2, and 5 have a frequency of
2.
• Example C: 5, 7, 9, 1, 7, 5, 0, 4
• This example has two modes 5 and 7. This is said to be bimodal.

Dispersion

Summarizing data from a measurement variable requires a number that


represents the "middle" of a set of numbers (known as a "statistic of central
tendency" or "statistic of location"), along with a measure of the "spread" of
the numbers (known as a "statistic of dispersion"). You use a statistic of
dispersion to give a single number that describes how compact or spread
out a set of observations is. Although statistics of dispersion are usually not
very interesting by themselves, they form the basis of most statistical tests
used on measurement variables. A statistical tool or software is used to
calculate the following.

Range: This is simply the difference between the largest and smallest
observations. This is the statistic of dispersion that people use in everyday
conversation; if you were telling your Uncle about your research on the
giant deep-sea isopod Bathynomus giganteus, you wouldn't blather about
means and standard deviations, you'd say they ranged from 4.4 to 36.5 cm
long (Biornes-Fourzán and Lozano-Alvarez 1991). Then you'd explain that
isopods are roly-polies, and 36.5 cm is about 14 American inches, and
Uncle would finally be impressed, because a roly-poly that's over a foot
long is pretty impressive.

Range is not very informative for statistical purposes. The range


depends only on the largest and smallest values, so that two sets of data
with very different distributions could have the same range, or two samples
from the same population could have very different ranges, purely by
chance. In addition, the range increases as the sample size increases; the
more observations you make, the greater the chance that you'll sample a
very large or very small value.

Sum of squares: This is not really a statistic of dispersion by itself, but it


forms the basis of the variance and standard deviation. Subtract the mean
from an observation and square this "deviate". Squaring the deviates
makes all of the squared deviates positive and has other statistical
advantages. Do this for each observation, then sum these squared
deviates. This sum of the squared deviates from the mean is known as the
sum of squares.
Sample variance: You almost always have a sample of observations
that you are using to estimate a population parameter. To get an unbiased
estimate of the population variance, divide the sum of squares by n−1, not
by n.

Standard deviation: In addition to being more understandable than the


variance as a measure of the amount of variation in the data, the standard
deviation summarizes how close observations are to the mean in an
understandable way. Many variables in biology fit the normal probability
distribution well.

Coefficient of variation. Coefficient of variation is the standard


deviation divided by the mean; it summarizes the amount of variation as a
percentage or proportion of the total. It is useful when comparing the
amount of variation for one variable among groups with different means, or
among different measurement variables.

How to calculate the statistics

There are a number of web pages that calculate range, variance, and
standard deviation, along with other descriptive statistics. Some of them
are given below.

Salvatore Mangiafico's R Companion has a sample R program for


calculating range, sample variance, standard deviation, and coefficient of
variation.

SAS
PROC UNIVARIATE will calculate the range, variance, standard
deviation and coefficient of variation. It calculates the sample variance and
sample standard deviation.
Standard Error

Standard error of the mean (SEM):

The standard error (to be more precise, the standard error of the
mean) is a property of our estimate of the mean. The SEM is equal to the
SD divided by the square root of n.

This quantity tells us how our estimate of the mean will vary from
sample to sample (these are theoretical samples, if we could redo our
exact study many times and compute the sample mean over and over
again and look at how it varies). It is a summary of how precise our
estimate is (as we expect, as sample size increases, our ability to estimate
the mean precisely improves, so the SEM decreases). See the difference?
Standard Deviation SD is concerned with the scatter of individual data
points in the population, while the SEM is concerned with the variability of
our estimate of the mean.

It is clear that the SEM will always be smaller than the SD (which is
not a function of sample size).

Standard Error Calculation Procedure:

Step 1: Calculate the mean (Total of all samples divided by the


number of samples).

Step 2: Calculate each measurement's deviation from the mean


(Mean minus the individual measurement).

Step 3: Square each deviation from mean. Squared negatives


become positive.

Step 4: Sum the squared deviations (Add up the numbers from step
3).

Step 5: Divide that sum from step 4 by one less than the sample size
(n-1, that is, the number of measurements minus one)
Step 6: Take the square root of the number in step 5. That gives you
the "standard deviation (S.D.)."

Step 7: Divide the standard deviation by the square root of the


sample size (n). That gives you the “standard error”.

Step 8: Subtract the standard error from the mean and record that
number. Then add the standard error to the mean and record that number.
You have plotted mean± 1 standard error (S. E.), the distance from 1
standard error below the mean to 1 standard error above the mean.

Confidence Limits for the Mean

Confidence limits for the mean (Snedecor and Cochran, 1989) are an
interval estimate for the mean. Interval estimates are often desirable
because the estimate of the mean varies from sample to sample. Instead of
a single estimate for the mean, a confidence interval generates a lower and
upper limit for the mean. The interval estimate gives an indication of how
much uncertainty there is in our estimate of the true mean. The narrower
the interval, the more precise is our estimate.

Confidence limits are expressed in terms of a confidence coefficient.


Although the choice of confidence coefficient is somewhat arbitrary, in
practice 90 %, 95 %, and 99 % intervals are often used, with 95 % being
the most commonly used.

As a technical note, a 95 % confidence interval does not mean that


there is a 95 % probability that the interval contains the true mean. The
interval computed from a given sample either contains the true mean or it
does not. Instead, the level of confidence is associated with the method of
calculating the interval. The confidence coefficient is simply the proportion
of samples of a given size that may be expected to contain the true mean.
That is, for a 95 % confidence interval, if many samples are collected and
the confidence interval computed, in the long run about 95 % of these
intervals would contain the true mean.

Confidence limits are defined as:


• where Y¯ is the sample mean, s is the sample standard
deviation, N is the sample size, α is the desired significance level,
and t1-α/2, N-1 is the 100(1-α/2) percentile of the t distribution with N - 1
degrees of freedom. Note that the confidence coefficient is 1 - α.

From the formula, it is clear that the width of the interval is controlled by two
factors:

1. As N increases, the interval gets narrower from the √N term.

That is, one way to obtain more precise estimates for the mean is to
increase the sample size.

2. The larger the sample standard deviation, the larger the confidence
interval. This simply means that noisy data, i.e., data with a large
standard deviation, are going to generate wider intervals than data
with a smaller standard deviation.

Confidence limits for the mean can be used to answer the following
questions:

1. What is a reasonable estimate for the mean?


2. How much variability is there in the estimate of the mean?
3. Does a given target value fall within the confidence limits?

Confidence limits for the mean are available in just about all general-
purpose statistical software programs. Both Dataplot code and R code can
be used to generate the analyses in this section.

Learning Task:

1. When do you use exact test of goodness of fit, chi-square, and


fisher’s test?
2. The systolic blood pressure (in mmHg) of 15 consecutive patients
entering the surgical ICU at a large urban hospital are as follows:
100, 120, 140, 110, 90, 130, 90, 60, 90, 130, 90. 130, 150, 90, and
105. Find the: a) mean b) median c) mode
3. Give an example where a standard error is significant to use.

One Sample T-Test

This procedure provides several reports for making inference about a


population mean based on a single sample.
For the one-sample situation, the typical concern in research is
examining a measure of central tendency (location) for the population of
interest. The best-known measures of location are the mean and
median. For a one-sample situation, we might want to know if the
average waiting time in a doctor’s office is greater than one hour, if the
average refund on a 1040 tax return is different from $500, if the
average assessment for similar residential properties is less than
$120,000, or if the average growth of roses is 4 inches or more after two
weeks of treatment with a certain fertilizer. One early concern should be
whether the data are normally distributed. If normality can safely be
assumed, then the one-sample t-test is the best choice for assessing
whether the measure of central tendency, the mean, is different from a
hypothesized value. On the other hand, if normality is not valid, one of
the nonparametric tests, such as the Wilcoxon Signed Rank test or the
quantile test, can be applied.

Data Structure
For this procedure, the data are entered as a single column and
specified as a response variable. Multiple columns can be analyzed
individually during the same run if multiple response variables are
specified.
Weight
159
155
157
125
103
122
101
82
228
199
195
110
191
151
119
119
112
87
190
87

Null and Alternative Hypotheses


The basic null hypothesis is that the population mean is equal to a
hypothesized value,
𝐻0: 𝜇 = Hypothesized value
with three common alternative hypotheses,
𝐻a: 𝜇 ≠ Hypothesized value
𝐻a: 𝜇 < Hypothesized value
𝐻a: 𝜇 > Hypothesized value,
one of which is chosen according to the nature of the experiment or study.

One-Sample T-Test Assumptions


The assumptions of the one-sample t-test are:
1. The data are continuous (not discrete).
2. The data follow the normal probability distribution.
3. The sample is a simple random sample from its population. Each
individual in the population has an equal probability of being selected in the
sample.

Running a One-Sample Analysis

Using the NCSS Statistical Software, this section presents an example of


how to run a one-sample analysis. The data are the Weight data shown
above and found in the Weight dataset. The data can be found under the
column labeled Weight. A researcher wishes to test whether the mean
weight of the population of interest is different from 130.

Setup
To run this example, complete the following steps:
1. Open the Weight example dataset
• From the File menu of the NCSS Data window, select Open Example
Data.
• Select Weight and click OK.
2. Specify the One-Sample T-Test procedure options
• Find and open the One-Sample T-Test procedure using the menus or the
Procedure Navigator.
• The settings for this example are listed below and are stored in the
Example 1 settings template. To load this template, click Open Example
Template in the Help Center or File menu.

Option Value

Variables Tab Response

Variable(s).............................................................Weight

Reports Tab

Descriptive Statistics ………………………………………….Checked


Confidence Level....................................................95

Limits .....................................................................Two-Sided

Confidence Interval of μ with σ Unknown.............. Checked

Confidence Interval of μ with σ Known.................. Checked

σ.............................................................................. 40

Confidence Interval of the Median......................... Checked

Bootstrap Confidence Intervals ............................. Checked

Sub-Options......................................................... Default Values

Confidence Interval of σ ........................................ Checked


Alpha...................................................................... 0.05

H0 μ = .................................................................... 130

Ha ..........................................................................Two-Sided and One-


Sided (Usually a single alternative hypothesis is chosen, but all three
alternatives are shown in this example to exhibit all the reporting options.)

T-Test .................................................................... Checked

Power Report for T-Test ........................................ Checked

Z-Test .................................................................... Checked


σ........................................................................... 40

Randomization Test............................................... Checked

Monte Carlo Samples .......................................... 10000

Quantile (Sign) Test............................................... Checked

Quantile Test Proportion...................................... 0.5

Wilcoxon Signed-Rank Test .................................. Checked

Sub-Options......................................................... Default Values


Tests of Assumptions ............................................ Checked Assumptions
Alpha............................................... ……………………0.05

Plots Tab

All Plots.................................................................. Checked

3. Run the procedure.

• Click the Run button to perform the calculations and generate the output

T-Test Section

This section presents the results of the traditional one-sample T-test. Here,
reports for all three alternative hypotheses are shown, but a researcher
would typically choose one of the three before generating the output. All
three tests are shown here for the purpose of exhibiting all the output
options available.

Mean - This is the average of the data values.

Standard Error- this is the estimated standard deviation of the distribution


of sample means for an infinite population.

T-Statistic

The T-Statistic is the value used to produce the p-value (Prob Level) based
on the T distribution.

d.f.

The degrees of freedom define the T distribution upon which the probability
values are based. The formula for the degrees of freedom is: 𝑑f= 𝑛 – 1

Prob Level

The probability level, also known as the p-value or significance level, is the
probability that the test statistic will take a value at least as extreme as the
observed value, assuming that the null hypothesis is true. If the p-value is
less than the prescribed α, in this case 0.05, the null hypothesis is rejected
in favor of the alternative hypothesis. Otherwise, there is not sufficient
evidence to reject the null hypothesis.

Reject H0 at α = (0.050)

This column indicates whether or not the null hypothesis is rejected, in


favor of the alternative hypothesis, based on the p-value and chosen α. A
test in which the null hypothesis is rejected is sometimes called significant.

 Statisticians use Statistical packages/softwares such as SPSS/NCSS


that do the work for computing One Sample T-Test.

Two Sample T-Test

What is the two-sample t-test?

The two-sample t-test (also known as the independent samples t-test)


is a method used to test whether the unknown population means of two
groups are equal or not.

When can I use the test?

You can use the test when your data values are independent, are
randomly sampled from two normal populations and the two independent
groups have equal variances.

What if I have more than two groups?

Use a multiple comparison method. Analysis of variance (ANOVA) is


one such method. Other multiple comparison methods include the Tukey-
Kramer test of all pairwise differences, analysis of means (ANOM) to
compare group means to the overall mean or Dunnett’s test to compare
each group mean to a control mean.

What if the variances for my two groups are not equal?


You can still use the two-sample t-test. You use a different estimate
of the standard deviation.

Using the two-sample t-test

What do we need?

For the two-sample t-test, we need two variables. One variable


defines the two groups. The second variable is the measurement of
interest. We also have an idea, or hypothesis, that the means of the
underlying populations for the two groups are different.

Example:

 We measure the grams of protein in two different brands of energy


bars. Our two groups are the two brands. Our measurement is the
grams of protein for each energy bar. Our idea is that the mean
grams of protein for the underlying populations for the two brands
may be different. We want to know if we have evidence that the mean
grams of protein for the two brands of energy bars is different or not.

Two sample t-test assumptions


To conduct a valid test:

 Data values must be independent. Measurements for one


observation do not affect measurements for any other observation.
 Data in each group must be obtained via a random sample from the
population.
 Data in each group are normally distributed.
 Data values are continuous.
 The variances for the two independent groups are equal.
For very small groups of data, it can be hard to test these requirements.
Below, we'll discuss how to check the requirements using software and
what to do when a requirement isn’t met.

Two-sample t-test example

One way to measure a person’s fitness is to measure their body fat


percentage. Average body fat percentages vary by age, but according to
some guidelines, the normal range for men is 15-20% body fat, and the
normal range for women is 20-25% body fat.

Our sample data is from a group of men and women who did
workouts at a gym three times a week for a year. Then, their trainer
measured the body fat. The table below shows the data.

Table 1: Body fat percentage data grouped by gender

Group Body Fat Percentages

Men 13.3 6.0 20.0 8.0 14.0

19.0 18.0 25.0 16.0 24.0

15.0 1.0 15.0

Women 22.0 16.0 21.7 21.0 30.0

26.0 12.0 23.2 28.0 23.0


You can clearly see some overlap in the body fat measurements for
the men and women in our sample, but also some differences. Just by
looking at the data, it's hard to draw any solid conclusions about whether
the underlying populations of men and women at the gym have the same
mean body fat. That is the value of statistical tests – they provide a
common, statistically valid way to make decisions, so that everyone makes
the same decision on the same set of data values.

Checking the data

Let’s start by answering: Is the two-sample t-test an appropriate method to


evaluate the difference in body fat between men and women?

 The data values are independent. The body fat for any one person
does not depend on the body fat for another person.
 We assume the people measured represent a simple random sample
from the population of members of the gym.
 We assume the data are normally distributed, and we can check this
assumption.
 The data values are body fat measurements. The measurements are
continuous.
 We assume the variances for men and women are equal, and we can
check this assumption.
Figure 1: Histogram and summary statistics for the body fat data

The two histograms are on the same scale. From a quick look, we
can see that there are no very unusual points, or outliers. The data look
roughly bell-shaped, so our initial idea of a normal distribution seems
reasonable.

Examining the summary statistics, we see that the standard


deviations are similar. This supports the idea of equal variances. We can
also check this using a test for variances.
Based on these observations, the two-sample t-test appears to be an
appropriate method to test for a difference in means.

Data format

There are two ways to enter and store the data for two-sample t-test.
The first is very similar to one-sample t-tests, except you have two columns
instead of one. One column of each sample. This first method is simplier to
use with the function t.test(). However, it often confuses students when
they think about whether the predictor and response variables are
continuous or categorical. The second is more typical of how researchers
store data. There are also two columns, but one column specifies the
sample (e.g., group 1 or group 2) and the other specifies the data you will
analyze. This format makes it clear that we have two variables, one is
categorical and the other is continuous. However, it requires a little more
effort to pull out the data for each group. Below are examples of both.
You can enter your data directly into R or enter into Excel and use the
function read.csv to get it into R. Below I have entered the data directly into
R with the function c() and then used the function data.frame() to create a
data.frame. I named the columns of the data Sample1 and Sample2.
First format
s1 <- c(23, 45, 34, 37, 29, 44, 40, 34)
s2 <- c(12, 20, 19, 18, 22, 14, 17, 17)
(twoSampleData <- data.frame(Sample1 = s1, Sample2 = s2))

## Sample1 Sample2
## 1 23 12
## 2 45 20
## 3 34 19
## 4 37 18
## 5 29 22
## 6 44 14
## 7 40 17
## 8 34 17
#Second format
sample <- rep(c("s1", "s2"), each = 8)
data <- c(23, 45, 34, 37, 29, 44, 40, 34, 12, 20, 19, 18, 22, 14, 17, 17)
(twoSampleData2 <- data.frame(Sample = sample, Data = data))

## Sample Data
## 1 s1 23
## 2 s1 45
## 3 s1 34
## 4 s1 37
## 5 s1 29
## 6 s1 44
## 7 s1 40
## 8 s1 34
## 9 s2 12
## 10 s2 20
## 11 s2 19
## 12 s2 18
## 13 s2 22
## 14 s2 14
## 15 s2 17
## 16 s2 17

Both formats have exactly the same data, and you can convert one to the
other.
Below is how to pull data from each format.
#Pull data from the first format
twoSampleData$Sample1

## [1] 23 45 34 37 29 44 40 34
twoSampleData$Sample2

## [1] 12 20 19 18 22 14 17 17

#Pull data from the second format


twoSampleData2$Data[twoSampleData2$Sample == "s1"]

## [1] 23 45 34 37 29 44 40 34

twoSampleData2$Data[twoSampleData2$Sample == "s2"]## [1] 12 20 19 18 22 14 17


17

P-value
We look up the P-value exactly the same way as for the one-sample t-test.
We need to know whether the test is a 1- or 2-tailed test. Then we can
calculate the probability of getting our t-value or something more extreme
for the appropriate degrees of freedom. Let us assume that we are
interested in just a difference between our samples or groups, and thus the
test should be 2-tailed (because the biological hypotheses are two-sided).
2*pt(-tTwoSample, length(twoSampleData$Sample1) +
length(twoSampleData$Sample2) - 2)

## [1] 1.612091e-05

Two sample, paired T-Test


The two-sample, paired t-test is very similar to the one-sample t-test
because you analyze the difference between each pair and thus have just
one variable (the difference between your two variables for each pair). After
you calculate the difference for each pair, everything is the same as the
one-sample t-test.
Let us assume that the data we just analyzed is paired and appropriate for
a two-sample, paired t-test. Also let’s assume that we expect sample 1 will
be less than sample 2. Therefore we expect the difference will be negative
and less than μ. So, we are interested in the left (negative) tail and should
perform a one-tailed test. First calculate the difference for each pair and
then calculate the t-value with the calculation for a one-sample t-test.
dif <- twoSampleData$Sample1 - twoSampleData$Sample2
meanDif <- mean(dif)
seDif <- sd(dif)/sqrt(length(dif))
(tDif <- (meanDif - 0)/seDif)

## [1] 6.893631

Now we need to calculate the P-value. Remember we want the left-hand


tail because we expected a negative difference between Sample 1 and
Sample 2 (i.e., Sample 1 - Sample 2 < 0).
pt(tDif, length(dif)-1)

## [1] 0.9998837

Do we reject or fail to reject the null hypothesis? Remember the null


hypothesis is that the difference between the samples is equal to or greater
than zero. The alternative is the difference is less than zero.
Now let’s use the function t.test(). There is an argument paired, and as you
might have guessed toggles between a paired and unpaired t-test. When
the argument is set to TRUE then the test is paired, and when it
is FALSE the test is unpaired. The default value is false.
t.test(twoSampleData$Sample1, twoSampleData$Sample2, paired = T, alternative
= "less")

##
## Paired t-test
##
## data: twoSampleData$Sample1 and twoSampleData$Sample2
## t = 6.8936, df = 7, p-value = 0.9999
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
## -Inf 23.42501
## sample estimates:
## mean of the differences
## 18.375

We didn’t need to worry about the argument var.equal because the test is
really a one-sample test there is really only one variable (the difference).
We could also analyze the difference with a one-sample t-test, which will
give us the same answer as above.
t.test(dif, alternative = "less")

##
## One Sample t-test
##
## data: dif
## t = 6.8936, df = 7, p-value = 0.9999
## alternative hypothesis: true mean is less than 0
## 95 percent confidence interval:
## -Inf 23.42501
## sample estimates:
## mean of x
## 18.375

The P-value is very large because it is calculated from the left-hand


(negative) tail, but the t-value is positive.

Draw your conclusion:


 If your calculated t value is greater than the critical T-value from the
table, you can conclude that the difference between the means for
the two groups is significantly different. We reject the null hypothesis
and conclude that the alternative hypothesis is correct.

 If your calculated t value is lower than the critical T-value from the
table, you can conclude that the difference between the means for
the two groups is NOT significantly different. We accept the null
hypothesis.

Independence

Most statistical tests assume that you have a sample of independent


observations, meaning that the value of one observation does not affect the
value of other observations. Non-independent observations can make your
statistical test give too many false positives.

A dependence is a connection between your data. For example, how


much you earn depends upon how many hours you
work. Independence means there isn’t a connection. For example, how
much you earn isn’t connected to what you ate for breakfast. The
assumption of independence means that your data isn’t connected in any
way (at least, in ways that you haven’t accounted for in your model).

 The assumption of independence is used for T Tests, in ANOVA


tests, and in several other statistical tests. It’s essential to getting
results from your sample that reflect what you would find in
a population. Even the smallest dependence in your data can turn
into heavily biased results (which may be undetectable) if you violate
this assumption.

Measurement variables

One of the assumptions of most tests is that the observations are


independent of each other. This assumption is violated when the value of
one observation tends to be too similar to the values of other observations.
For example, let's say you wanted to know whether calico cats had a
different mean weight than black cats. You get five calico cats, five black
cats, weigh them, and compare the mean weights with a two-sample t–test.
If the five calico cats are all from one litter, and the five black cats are all
from a second litter, then the measurements are not independent. Some
cat parents have small offspring, while some have large; so if Josie the
calico cat is small, her sisters Valerie and Melody are not independent
samples of all calico cats, they are instead also likely to be small. Even if
the null hypothesis (that calico and black cats have the same mean weight)
is true, your chance of getting a P value less than 0.05 could be much
greater than 5%.
A common source of non-independence is that observations are close
together in space or time. For example, let's say you wanted to know
whether tigers in a zoo were more active in the morning or the evening. As
a measure of activity, you put a pedometer on Sally the tiger and count the
number of steps she takes in a one-minute period. If you treat the number
of steps Sally takes between 10:00 and 10:01 a.m. as one observation, and
the number of steps between 10:01 and 10:02 a.m. as a separate
observation, these observations are not independent. If Sally is sleeping
from 10:00 to 10:01, she's probably still sleeping from 10:01 to 10:02; if
she's pacing back and forth between 10:00 and 10:01, she's probably still
pacing between 10:01 and 10:02. If you take five observations between
10:00 and 10:05 and compare them with five observations you take
between 3:00 and 3:05 with a two-sample t–test, there a good chance you'll
get five low-activity measurements in the morning and five high-activity
measurements in the afternoon, or vice-versa. This increases your chance
of a false positive; if the null hypothesis is true, lack of independence can
give you a significant P value much more than 5% of the time.
There are other ways you could get lack of independence in your tiger
study. For example, you might put pedometers on four other tigers—Bob,
Janet, Ralph, and Loretta—in the same enclosure as Sally, measure the
activity of all five of them between 10:00 and 10:01, and treat that as five
separate observations. However, it may be that when one tiger gets up and
starts walking around, the other tigers are likely to follow it around and see
what it's doing, while at other times all five tigers are likely to be resting.
That would mean that Bob's amount of activity is not independent of Sally's;
when Sally is more active, Bob is likely to be more active.

Regression and correlation assume that observations are independent.


Linear regression and correlation assume that the data points
are independent of each other, meaning that the value of one data point
does not depend on the value of any other data point. The most common
violation of this assumption in regression and correlation is in time series
data, where some Y variable has been measured at different times. For
example, biologists have counted the number of moose on Isle Royale, a
large island in Lake Superior, every year. Moose live a long time, so the
number of moose in one year is not independent of the number of moose in
the previous year, it is highly dependent on it; if the number of moose in
one year is high, the number in the next year will probably be pretty high,
and if the number of moose is low one year, the number will probably be
low the next year as well. This kind of non-independence, or
"autocorrelation," can give you a "significant" regression or correlation
much more often than 5% of the time, even when the null hypothesis of no
relationship between time and Y is true. If both X and Y are time series—for
example, you analyze the number of wolves and the number of moose on
Isle Royale—you can also get a "significant" relationship between them
much too often.

If one of the measurement variables is time, or if the two variables are


measured at different times, the data are often non-independent. Another
example, if I wanted to know whether I was losing weight, I could weigh
myself every day and then do a regression of weight vs. day. However, my
weight on one day is very similar to my weight on the next day. Even if the
null hypothesis is true that I'm not gaining or losing weight, the non-
independence will make the probability of getting a P value less than 0.05
much greater than 5%.

Nominal variables

Tests of nominal variables (independence or goodness-of-fit) also


assume that individual observations are independent of each other. To
illustrate this, let's say I want to know whether my statistics class is more
boring than my evolution class. I set up a video camera observing the
students in one lecture of each class, then count the number of students
who yawn at least once. In statistics, 28 students yawn and 15 don't yawn;
in evolution, 6 yawn and 50 don't yawn.The observations within each class
are not independent of each other. Yawning is contagious (so contagious
that you're probably yawning right now, aren't you?), which means that if
one person near the front of the room in statistics happens to yawn, other
people who can see the yawner are likely to yawn as well. So the
probability that Ashley in statistics yawns is not independent of whether Sid
yawns; once Sid yawns, Ashley will probably yawn as well, and then
Megan will yawn, and then Dave will yawn.

Solutions for lack of independence

It is not easy to look at your data and see whether the data are non-
independent. You need to understand the biology of your organisms and
carefully design your experiment so that the observations will be
independent. For your comparison of the weights of calico cats vs. black
cats, you should know that cats from the same litter are likely to be similar
in weight; you could therefore make sure to sample only one cat from each
of many litters. You could also sample multiple cats from each litter but
treat "litter" as a second nominal variable and analyze the data
using nested anova. For Sally the tiger, you might know from previous
research that bouts of activity or inactivity in tigers last for 5 to 10 minutes,
so that you could treat one-minute observations made an hour apart as
independent. Or you might know from previous research that the activity of
one tiger has no effect on other tigers, so measuring activity of five tigers at
the same time would actually be okay.
For regression and correlation analyses of data collected over a length
of time, there are statistical tests developed for time series.

Normality

Most tests for measurement variables assume that data are normally
distributed (fit a bell-shaped curve).

Histogram of dry weights of the amphipod crustacean Platorchestia


platensis.

A probability distribution specifies the probability of getting an


observation in a particular range of values; the normal distribution is the
familiar bell-shaped curve, with a high probability of getting an observation
near the middle and lower probabilities as you get further from the middle.
A normal distribution can be completely described by just two numbers, or
parameters, the mean and the standard deviation; all normal distributions
with the same mean and same standard deviation will be exactly the same
shape. One of the assumptions of an anova and other tests for
measurement variables is that the data fit the normal probability
distribution. Because these tests assume that the data can be described by
two parameters, the mean and standard deviation, they are called
parametric tests.
When you plot a frequency histogram of measurement data, the
frequencies should approximate the bell-shaped normal distribution. For
example, the figure shown above is a histogram of dry weights of newly
hatched amphipods (Platorchestia platensis). It fits the normal distribution
pretty well.
Many biological variables fit the normal distribution quite well. This is a
result of the central limit theorem, which says that when you take a large
number of random numbers, the means of those numbers are
approximately normally distributed. If you think of a variable like weight as
resulting from the effects of a bunch of other variables averaged
together—age, nutrition, disease exposure, the genotype of several genes,
etc.—it's not surprising that it would be normally distributed.
Two non-normal histograms.

Other data sets don't fit the normal distribution very well. The histogram
on the top is the level of sulphate in Maryland streams (data from
the Maryland Biological Stream Survey). It doesn't fit the normal curve very
well, because there are a small number of streams with very high levels of
sulphate. The histogram on the bottom is the number of egg masses laid by
indivuduals of the lentago host race of the
treehopper Enchenopa (unpublished data courtesy of Michael Cast). The
curve is bimodal, with one peak at around 14 egg masses and the other at
zero.
Parametric tests assume that your data fit the normal distribution. If your
measurement variable is not normally distributed, you may be increasing
your chance of a false positive result if you analyze the data with a test that
assumes normality.

What to do about non-normality


Once you have collected a set of measurement data, you should look at
the frequency histogram to see if it looks non-normal. There are statistical
tests of the goodness-of-fit of a data set to the normal distribution, but I
don't recommend them, because many data sets that are significantly non-
normal would be perfectly appropriate for an anova or other parametric
test. Fortunately, an anova is not very sensitive to moderate deviations
from normality; simulation studies, using a variety of non-normal
distributions, have shown that the false positive rate is not affected very
much by this violation of the assumption (Glass et al. 1972, Harwell et al.
1992, Lix et al. 1996). This is another result of the central limit theorem,
which says that when you take a large number of random samples from a
population, the means of those samples are approximately normally
distributed even when the population is not normal.
Because parametric tests are not very sensitive to deviations from
normality, I recommend that you don't worry about it unless your data
appear very, very non-normal to you. This is a subjective judgement on
your part, but there don't seem to be any objective rules on how much non-
normality is too much for a parametric test. You should look at what other
people in your field do; if everyone transforms the kind of data you're
collecting, or uses a non-parametric test, you should consider doing what
everyone else does even if the non-normality doesn't seem that bad to you.
If your histogram looks like a normal distribution that has been pushed to
one side, like the sulphate data above, you should try different data
transformations to see if any of them make the histogram look more
normal. It's best if you collect some data, check the normality, and decide
on a transformation before you run your actual experiment; you don't want
cynical people to think that you tried different transformations until you
found one that gave you a signficant result for your experiment.
If your data still look severely non-normal no matter what transformation
you apply, it's probably still okay to analyze the data using a parametric
test; they're just not that sensitive to non-normality. However, you may want
to analyze your data using a non-parametric test. Just about every
parametric statistical test has a non-parametric substitute, such as
the Kruskal–Wallis test instead of a one-way anova, Wilcoxon signed-rank
test instead of a paired t–test, and Spearman rank correlation instead of
linear regression/correlation. These non-parametric tests do not assume
that the data fit the normal distribution. They do assume that the data in
different groups have the same distribution as each other, however; if
different groups have different shaped distributions (for example, one is
skewed to the left, another is skewed to the right), a non-parametric test will
not be any better than a parametric one.

Skewness and kurtosis


Graphs illustrating skewness and kurtosis.

A histogram with a long tail on the right side, such as the sulphate data
above, is said to be skewed to the right; a histogram with a long tail on the
left side is said to be skewed to the left. There is a statistic to describe
skewness, g1; there is no rule of thumb that you shouldn't do a parametric
test if g1 is greater than some cutoff value.
Another way in which data can deviate from the normal distribution is
kurtosis. A histogram that has a high peak in the middle and long tails on
either side is leptokurtic; a histogram with a broad, flat middle and short
tails is platykurtic. The statistic to describe kurtosis is g2.

One-way and Two-way Anova

A key statistical test in research fields including biology, economics and


psychology, Analysis of Variance (ANOVA) is very useful for analyzing
datasets. It allows comparisons to be made between three or more groups
of data. Here, we summarize the key differences between these two tests,
including the assumptions and hypotheses that must be made about each
type of test.
There are two types of ANOVA that are commonly used, the One-Way
ANOVA and the Two-Way ANOVA. This article will explore this important
statistical test and the difference between these two types of ANOVA.

What is a One-Way ANOVA?


A one-way ANOVA is a type of statistical test that compares the
variance in the group means within a sample whilst considering only one
independent variable or factor. It is a hypothesis-based test, meaning that it
aims to evaluate multiple mutually exclusive theories about our data.
Before we can generate a hypothesis, we need to have a question about
our data that we want an answer to. For example, adventurous researchers
studying a population of walruses might ask “Do our walruses weigh more
in early or late mating season?” Here, the independent variable or factor
(the two terms mean the same thing) is “month of mating season”. In an
ANOVA, our independent variables are organized in categorical groups.
For example, if the researchers looked at walrus weight in December,
January, February and March, there would be four months analyzed, and
therefore four groups to the analysis.
A one-way ANOVA compares three or more than three categorical
groups to establish whether there is a difference between them. Within
each group there should be three or more observations (here, this means
walruses), and the means of the samples are compared.

What are the hypotheses of a One-Way ANOVA?


In a one-way ANOVA there are two possible hypotheses.

 The null hypothesis (H0) is that there is no difference between the


groups and equality between means. (Walruses weigh the same in
different months)
 The alternative hypothesis (H1) is that there is a difference between
the means and groups. (Walruses have different weights in different
months)

What are the assumptions of a One-Way ANOVA?

 Normality – That each sample is taken from a normally distributed


population.
 Sample independence – that each sample has been drawn
independently of the other samples.
 Variance Equality – That the variance of data in the different groups
should be the same.
 Your dependent variable – here, “weight”, should be continuous –
that is, measured on a scale which can be subdivided using
increments (i.e. grams, milligrams)

What is a Two-Way ANOVA?

A two-way ANOVA is, like a one-way ANOVA, a hypothesis-based


test. However, in the two-way ANOVA each sample is defined in two ways,
and resultingly put into two categorical groups. Thinking again of our
walruses, researchers might use a two-way ANOVA if their question is:
“Are walruses heavier in early or late mating season and does that depend
on the gender of the walrus?” In this example, both “month in mating
season” and “gender of walrus” are factors – meaning in total, there are two
factors. Once again, each factor’s number of groups must be considered –
for “gender” there will only two groups “male” and “female”.
The two-way ANOVA therefore examines the effect of two factors (month
and gender) on a dependent variable – in this case weight, and also
examines whether the two factors affect each other to influence the
continuous variable.

What are the assumptions of a Two-Way ANOVA?

 Your dependent variable – here, “weight”, should be continuous –


that is, measured on a scale which can be subdivided using
increments (i.e. grams, milligrams)
 Your two independent variables – here, “month” and “gender”, should
be in categorical, independent groups.
 Sample independence – that each sample has been drawn
independently of the other samples.
 Variance Equality – That the variance of data in the different groups
should be the same.
 Normality – That each sample is taken from a normally distributed
population.

What are the hypotheses of a Two-Way ANOVA?

Because the two-way ANOVA consider the effect of two categorical


factors, and the effect of the categorical factors on each other, there are
three pairs of null or alternative hypotheses for the two-way ANOVA. Here,
we present them for our walrus experiment, where month of mating season
and gender are the two independent variables.

 H0: The means of all month groups are equal.


 H1: The mean of at least one month group is different.

 H0: The means of the gender groups are equal.


 H1: The means of the gender groups are different.

 H0: There is no interaction between the month and gender.


 H1: There is interaction between the month and gender.

Summary: Differences Between One-Way and Two-Way ANOVA

The key differences between one-way and two-way ANOVA are


summarized clearly below.

1. A one-way ANOVA is primarily designed to enable the equality testing


between three or more means. A two-way ANOVA is designed to assess
the interrelationship of two independent variables on a dependent variable.

2. A one-way ANOVA only involves one factor or independent variable,


whereas there are two independent variables in a two-way ANOVA.

3. In a one-way ANOVA, the one factor or independent variable analyzed


has three or more categorical groups. A two-way ANOVA instead
compares multiple groups of two factors.
4. One-way ANOVA need to satisfy only two principles of design of
experiments, i.e. replication and randomization. As opposed to Two-way
ANOVA, which meets all three principles of design of experiments which
are replication, randomization, and local control.

One-Way vs Two-Way ANOVA Differences Chart

One-Way ANOVA Two-Way ANOVA

Definition A test that allows one to A test that allows one to make
make comparisons comparisons between the
between the means of means of three or more
three or more groups of groups of data, where two
data. independent variables are
considered.

Number of One. Two.


Independent
Variables

What is Being The means of three or The effect of multiple groups


Compared? more groups of an of two independent variables
independent variable on on a dependent variable and
a dependent variable. on each other.

Number of Three or more. Each variable should have


Groups of multiple samples.
Samples

Use one-way anova when you have one nominal variable and one
measurement variable; the nominal variable divides the measurements into
two or more groups. It tests whether the means of the measurement
variable are the same for the different groups.
Use two-way anova when you have one measurement variable and two
nominal variables, and each value of one nominal variable is found in
combination with each value of the other nominal variable. It tests three null
hypotheses: that the means of the measurement variable are equal for
different values of the first nominal variable; that the means are equal for
different values of the second nominal variable; and that there is no
interaction (the effects of one nominal variable don't depend on the value of
the other nominal variable).
Salvatore Mangiafico's R Companion has a sample R program for one-
way and two-way anova.

Correlation and linear regression

The statistical tools used for hypothesis testing, describing the


closeness of the association, and drawing a line through the points, are
correlation and linear regression. Use linear regression or correlation when
you want to know whether one measurement variable is associated with
another measurement variable; you want to measure the strength of the
association (r2); or you want an equation that describes the relationship and
can be used to predict unknown values.
Independent vs. dependent variables
When you are testing a cause-and-effect relationship, the variable that
causes the relationship is called the independent variable and you plot it on
the X axis, while the effect is called the dependent variable and you plot it
on the Y axis. In some experiments you set the independent variable to
values that you have chosen; for example, if you're interested in the effect
of temperature on calling Pulse,
Speed, rate of frogs, you might put frogs in temperature
kph bpm
chambers set to 10°C, 15°C, 20°C, etc. In other cases, both variables
exhibit natural variation, but any cause-and-effect relationship would be in
one way; if you measure the air temperature and frog calling rate at a pond
0 57
on several different nights, both the air temperature and the calling rate
would display natural variation, but if there's a cause-and-effect
relationship, it's temperature
1.6 69
affecting calling rate; the rate at which frogs
call does not affect the air temperature.

Introduction to correlation
3.1 and
78 linear regression

One of the most common graphs in science plots one measurement


variable on the x (horizontal)
4 80 axis vs. another on the y (vertical) axis. For
example, here is a graph. For this graph, I dusted off the elliptical machine
in our basement and measured my pulse after one minute of ellipticizing at
various speeds. 5 85

6 87

6.9 90
7.7 92

8.7 97

12.4 108

15.3 119
There are three things you can do with this kind of data. One is a
hypothesis test, to see if there is an association between the two variables;
in other words, as the X variable goes up, does the Y variable tend to
change (up or down). For the exercise data, you'd want to know whether
pulse rate was significantly higher with higher speeds. The P value is
1.3×10−8, but the relationship is so obvious from the graph, and so
biologically unsurprising (of course my pulse rate goes up when I exercise
harder!), that the hypothesis test wouldn't be a very interesting part of the
analysis.
The second goal is to describe how tightly the two variables are
associated. This is usually expressed with r, which ranges from −1 to 1,
or r2, which ranges from 0 to 1. For the exercise data, there's a very tight
relationship; this means that if you knew my speed on the elliptical
machine, you'd be able to predict my pulse quite accurately.
The final goal is to determine the equation of a line that goes through
the cloud of points. The equation of a line is given in the form Ŷ=a+bX,
where Ŷ is the value of Y predicted for a given value of X, a is
the Y intercept (the value of Y when X is zero), and b is the slope of the line
(the change in Ŷ for a change in X of one unit). For the exercise data, the
equation is Ŷ=63.5+3.75X; this predicts that my pulse would be 63.5 when
the speed of the elliptical machine is 0 kph, and my pulse would go up by
3.75 beats per minute for every 1 kph increase in speed. This is probably
the most useful part of the analysis for the exercise data; if I wanted to
exercise with a particular level of effort, as measured by pulse rate, I could
use the equation to predict the speed I should use.
When to use them
Use correlation/linear regression when you have two measurement
variables, such as food intake and weight, drug dosage and blood
pressure, air temperature and metabolic rate, etc.
There's also one nominal variable that keeps the two measurements
together in pairs, such as the name of an individual organism, experimental
trial, or location. I'm not aware that anyone else considers this nominal
variable to be part of correlation and regression, and it's not something you
need to know the value of—you could indicate that a food intake
measurement and weight measurement came from the same rat by putting
both numbers on the same line, without ever giving the rat a name. For that
reason, I'll call it a "hidden" nominal variable.
The main value of the hidden nominal variable is that it lets me make
the blanket statement that any time you have two or more measurements
from a single individual (organism, experimental trial, location, etc.), the
identity of that individual is a nominal variable; if you only have one
measurement from an individual, the individual is not a nominal variable.

There are three main goals for correlation and regression in biology.
One is to see whether two measurement variables are associated with
each other; whether as one variable increases, the other tends to increase
(or decrease). You summarize this test of association with the P value. In
some cases, this addresses a biological question about cause-and-effect
relationships; a significant association means that different values of the
independent variable cause different values of the dependent. An example
would be giving people different amounts of a drug and measuring their
blood pressure. The null hypothesis would be that there was no relationship
between the amount of drug and the blood pressure. If you reject the null
hypothesis, you would conclude that the amount of drug causes the
changes in blood pressure. In this kind of experiment, you determine the
values of the independent variable; for example, you decide what dose of
the drug each person gets. The exercise and pulse data are an example of
this, as I determined the speed on the elliptical machine, then measured
the effect on pulse rate.

In other cases, you want to know whether two variables are


associated, without necessarily inferring a cause-and-effect relationship. In
this case, you don't determine either variable ahead of time; both are
naturally variable and you measure both of them. If you find an association,
you infer that variation in X may cause variation in Y, or variation in Y may
cause variation in X, or variation in some other factor may affect
both Y and X. An example would be measuring the amount of a particular
protein on the surface of some cells and the pH of the cytoplasm of those
cells. If the protein amount and pH are correlated, it may be that the
amount of protein affects the internal pH; or the internal pH affects the
amount of protein; or some other factor, such as oxygen concentration,
affects both protein concentration and pH. Often, a significant correlation
suggests further experiments to test for a cause and effect relationship; if
protein concentration and pH were correlated, you might want to
manipulate protein concentration and see what happens to pH, or
manipulate pH and measure protein, or manipulate oxygen and see what
happens to both.
The second goal of correlation and regression is estimating the
strength of the relationship between two variables; in other words, how
close the points on the graph are to the regression line. You summarize
this with the r2 value. For example, let's say you've measured air
temperature (ranging from 15 to 30°C) and running speed in the
lizard Agama savignyi, and you find a significant relationship: warmer
lizards run faster. You would also want to know whether there's a tight
relationship (high r2), which would tell you that air temperature is the main
factor affecting running speed; if the r2 is low, it would tell you that other
factors besides air temperature are also important, and you might want to
do more experiments to look for them. You might also want to know how
the r2 for Agama savignyi compared to that for other lizard species, or
for Agama savignyi under different conditions.

The third goal of correlation and regression is finding the equation of


a line that fits the cloud of points. You can then use this equation for
prediction. For example, if you have given volunteers diets with 500 to 2500
mg of salt per day, and then measured their blood pressure, you could use
the regression line to estimate how much a person's blood pressure would
go down if they ate 500 mg less salt per day.

Correlation versus linear regression

The main difference between correlation and regression is that in


correlation, you sample both measurement variables randomly from a
population, while in regression you choose the values of the independent
(X) variable. For example, let's say you're a forensic anthropologist,
interested in the relationship between foot length and body height in
humans. If you find a severed foot at a crime scene, you'd like to be able to
estimate the height of the person it was severed from. You measure the
foot length and body height of a random sample of humans, get a
significant P value, and calculate r2 to be 0.72. This is a correlation,
because you took measurements of both variables on a random sample of
people.

As an example of regression let’s say you're interested in the effect of


air temperature on running speed in lizards. You put some lizards in a
temperature chamber set to 10°C, chase them, and record how fast they
run. You do the same for 10 different temperatures, ranging up to 30°C.
This is a regression, because you decided which temperatures to use.

If you are mainly interested in using the P value for hypothesis


testing, to see whether there is a relationship between the two variables, it
doesn't matter whether you call the statistical test a regression or
correlation. If you are interested in comparing the strength of the
relationship (r2) to the strength of other relationships, you are doing a
correlation and should design your experiment so that you
measure X and Y on a random sample of individuals. If you determine
the X values before you do the experiment, you are doing a regression and
shouldn't interpret the r2 as an estimate of something general about the
population you've observed.

How to do the test


Salvatore Mangiafico's R Companion has a sample R program for
correlation and linear regression.

Spearman rank correlation

Use Spearman rank correlation to test the association between two


ranked variables, or one ranked variable and one measurement variable.
You can also use Spearman rank correlation instead of linear
regression/correlation for two measurement variables if you're worried
about non-normality, but this is not usually necessary.

Use Spearman rank correlation when you have two ranked variables,
and you want to see whether the two variables covary; whether, as one
variable increases, the other variable tends to increase or decrease. You
also use Spearman rank correlation if you have one measurement
variable and one ranked variable; in this case, you convert the
measurement variable to ranks and use Spearman rank correlation on the
two sets of ranks.

For example, Melfi and Poyser (2007) observed the behavior of 6 male
colobus monkeys (Colobus guereza) in a zoo. By seeing which monkeys
pushed other monkeys out of their way, they were able to rank the
monkeys in a dominance hierarchy, from most dominant to least dominant.
This is a ranked variable; while the researchers know that Erroll is
dominant over Milo because Erroll pushes Milo out of his way, and Milo is
dominant over Fraiser, they don't know whether the difference in
dominance between Erroll and Milo is larger or smaller than the difference
in dominance between Milo and Fraiser. After determining the dominance
rankings, Melfi and Poyser (2007) counted eggs of Trichuris nematodes per
gram of monkey feces, a measurement variable. They wanted to know
whether social dominance was associated with the number of nematode
eggs, so they converted eggs per gram of feces to ranks and used
Spearman rank correlation. For the Colobus monkey example, Spearman's
ρ is 0.943, and the P value from the table is less than 0.025, so the
association between social dominance and nematode eggs is significant.

Eggs Eggs per


Monkey Dominance per gram
name rank gram (rank)

Erroll 1 5777 1

Milo 2 4225 2

Fraiser 3 2674 3

Fergus 4 1249 4

Kabul 5 749 6

Hope 6 870 5

Some people use Spearman rank correlation as a non-parametric


alternative to linear regression and correlation when they have two
measurement variables and one or both of them may not be normally
distributed; this requires converting both measurements to ranks. Linear
regression and correlation that the data are normally distributed, while
Spearman rank correlation does not make this assumption, so people think
that Spearman correlation is better.

It's not incorrect to use Spearman rank correlation for two


measurement variables, but linear regression and correlation are much
more commonly used and are familiar to more people, so I recommend
using linear regression and correlation any time you have two
measurement variables, even if they look non-normal.

Null hypothesis
The null hypothesis is that the Spearman correlation coefficient, ρ
("rho"), is 0. A ρ of 0 means that the ranks of one variable do not covary
with the ranks of the other variable; in other words, as the ranks of one
variable increase, the ranks of the other variable do not increase (or
decrease).

Assumption
When you use Spearman rank correlation on one or two measurement
variables converted to ranks, it does not assume that the measurements
are normal or homoscedastic. It also doesn't assume the relationship is
linear; you can use Spearman rank correlation even if the association
between the variables is curved, as long as the underlying relationship is
monotonic (as X gets larger, Y keeps getting larger, or keeps getting
smaller). If you have a non-monotonic relationship (as X gets larger, Y gets
larger and then gets smaller, or Y gets smaller and then gets larger, or
something more complicated), you shouldn't use Spearman rank
correlation.

Like linear regression and correlation, Spearman rank correlation


assumes that the observations are independent.

How the test works

Spearman rank correlation calculates the P value the same way as


linear regression and correlation, except that you do it on ranks, not
measurements. To convert a measurement variable to ranks, make the
largest value 1, second largest 2, etc. Use the average ranks for ties; for
example, if two observations are tied for the second-highest rank, give
them a rank of 2.5 (the average of 2 and 3).
If you have 10 or fewer observations, the P value calculated from the t-
distribution is somewhat inaccurate. In that case, you should look up
the P value in a table of Spearman t-statistics for your sample size.

Example

Volume Frequency 2740 532


(cm3) (Hz)

1760 529

3010 484

2040 566

3080 527

2440 473

3370 488

2550 461

3740 485

2730 465

4910 478
5090 434

5090 468

5380 449

5850 425

6730 389

6990 421

7960 416
Magnificent frigatebird, Fregata magnificens.

Males of the magnificent frigatebird (Fregata magnificens) have a large


red throat pouch. They visually display this pouch and use it to make a
drumming sound when seeking mates. Madsen et al. (2004) wanted to
know whether females, who presumably choose mates based on their
pouch size, could use the pitch of the drumming sound as an indicator of
pouch size. The authors estimated the volume of the pouch and the
fundamental frequency of the drumming sound in 18 males.
There are two measurement variables, pouch size and pitch. The
authors analyzed the data using Spearman rank correlation, which converts
the measurement variables to ranks, and the relationship between the
variables is significant (Spearman's rho=-0.76, 16 d.f., P=0.0002).

Salvatore Mangiafico's R Companion has a sample R program for


Spearman rank correlation.

You might also like