Professional Documents
Culture Documents
Stats PDF
Stats PDF
Learning Module
in
STATISTICAL BIOLOGY
Consolidated by:
Ms. Evangeline Joyce D. Jungay
Introduction
Students may own their learning by studying this material in their own
convenient time outside of class schedule as long as they comply with the
completion of all the learning tasks to be completed during the term.
Objectives
Lesson Proper
Learning Tasks
References
Intended Learning Outcomes
Learning Objectives:
What is statistics?
• How confident can I be that the sample of individuals I have studied was
like the group as a whole?
1. Specify the biological question you are asking. (Ex. Do the amino
acid polymorphisms at the Pgm locus have an effect on glycogen
content?" The biological question is usually something about
biological processes, often in the form "Does changing X cause a
change in Y?" You might want to know whether a drug changes blood
pressure; whether soil pH affects the growth of blueberry bushes).
2. Put the question in the form of a biological null hypothesis and
alternate hypothesis. (The biological null hypothesis is "Different
amino acid sequences do not affect the biochemical properties of
PGM, so glycogen content is not affected by PGM sequence." The
biological alternative hypothesis is "Different amino acid sequences
do affect the biochemical properties of PGM, so glycogen content is
affected by PGM sequence." By thinking about the biological null and
alternative hypotheses, you are making sure that your experiment will
give different results for different answers to your biological question).
3. Put the question in the form of a statistical null hypothesis and
alternate hypothesis. (The statistical null hypothesis is "Flies with
different sequences of the PGM enzyme have the same average
glycogen content." The alternate hypothesis is "Flies with different
sequences of PGM have different average glycogen contents." While
the biological null and alternative hypotheses are about biological
processes, the statistical null and alternative hypotheses are all about
the numbers; in this case, the glycogen contents are either the same
or different).
4. Determine which variables are relevant to the question.
5. Determine what kind of variable each one is.
6. Design an experiment that controls or randomizes the confounding
variables.
7. Based on the number of variables, the kinds of variables, the
expected fit to the parametric assumptions, and the hypothesis to be
tested, choose the best statistical test to use.
8. Do the experiment.
9. Examine the data to see if it meets the assumptions of the statistical
test you chose. If it doesn't, choose a more appropriate test.
10. Apply the statistical test you chose and interpret the results.
One important point for you to remember: "do the experiment" is step 9,
not step 1. You should do a lot of thinking, planning, and decision-
making before you do an experiment. If you do this, you'll have an
experiment that is easy to understand, easy to analyze and interpret,
answers the questions you're trying to answer, and is neither too big nor
too small. If you just slap together an experiment without thinking about
how you're going to do the statistics, you may end up needing more
complicated and obscure statistical tests, getting results that are difficult to
interpret and explain to others, and maybe using too many subjects (thus
wasting your resources) or too few subjects (thus wasting the whole
experiment).
1. What could result when one does not follow the steps in analyzing
data?
2. What are the different kinds of biological variables?
3. What is the significance of using probability in analyzing data?
Learning Objectives:
Very few environmental or biological studies are done solely for the
interest of the researcher; they are carried out to inform others. Therefore,
it is not just important that our results convince us, it is vital that they
convince others too, otherwise we have wasted our time. Following
accepted statistical procedures for design and analysis of experiments will
help us to achieve this.
Variability
Think of a group you might want to study, e.g. the lengths of fish in a large
lake. If all of these fish were the same length, you would only need to
measure one. You can probably accept that they are not all the same
length, just as people are not all the same height, not all volcanic lava flows
are the same temperature, and not all carrots have the same sugar
content. In fact, most characteristics we might want to study vary between
individuals.
In the first example, the population is real but we are unlikely to be able to
study all of the whales in practice. Populations in the statistical sense,
however, need not be finite, or even exist in real life. In the second
example, the light intensity could be measured at any moment, but the
number of moments is infinite, so we could never obtain measurements at
every moment. In the third example, the population is just conceptual. We
really want to know about how rice plants of this variety in general would
grow under these conditions but we would have to infer this by growing a
limited number of rice plants under the specified conditions. Although the
few plants in our sample may be the only rice plants ever to be grown in
these conditions, we still consider them to be a sample representing rice
plants of this variety in general growing in these conditions.
One way to characterize how spread out the values in a sample are would
be to calculate the difference between each measurement and the sample
mean, and then to calculate the mean of these differences. Here's an
example:
When we take a random sample it may or may not include the largest and
smallest values in the population, yet these would both contribute the
largest squares of differences from the mean. Since they are not present in
all samples, on average the mean of the squares of differences is less
when it is calculated from a sample than if it was calculated for the
population as a whole.
This figure - the corrected mean of the squared differences - is called the
variance and is an unbiased estimate of the spread of values in the
population, calculated from a sample. Variance has units, e.g. if the
measurements had been in grams (g), the variance would be in units of
square grams (g2). The square root of the variance is called the standard
deviation. The standard deviation in the above example is 4.O = 2.0, i.e.
standard deviation is an alternative measure of the spread of values.
Standard deviation has the same units as the actual measurements, e.g. if
the measurements had been in grams, the standard deviation would also
be in grams.
Hypothesis Testing
For example, if we were to test the hypothesis that college freshmen study
20 hours per week, we would express our null hypothesis as:
H0 : µ = 20
Ha : µ ≠ 20
Example A
Solution
H0 : µ = 14
Ha : µ ≠14
Our null hypothesis states that the population has a mean equal to 14
milligrams. Our alternative hypothesis states that the population has a
mean that is different than 14 milligrams.
In a two-tailed test, you will reject the null hypothesis if your sample mean
falls in either tail of the distribution. For this reason, the alpha level (let’s
assume .05) is split across the two tails. The curve below shows the critical
regions for a two-tailed test. These are the regions under the normal curve
that, together, sum to a probability of 0.05. Each tail has a probability of
0.025. The z-scores that designate the start of the critical region are called
the critical values.
If the sample mean taken from the population falls within these critical
regions, or "rejection regions," we would conclude that there was too much
of a difference and we would reject the null hypothesis. However, if the
mean from the sample falls in the middle of the distribution (in between the
critical regions) we would fail to reject the null hypothesis.
One-Tailed Hypothesis Test
We would use a single-tail hypothesis test when the direction of the results
is anticipated or we are only interested in one direction of the results. For
example, a single-tail hypothesis test may be used when evaluating
whether or not to adopt a new textbook. We would only decide to adopt the
textbook if it improved student achievement relative to the old textbook.
H0 : µ ≤ 1100
Ha : µ > 1100
In this scenario, our null hypothesis states that the mean SAT scores would
be less than or equal to 1,100 while the alternate hypothesis states that the
SAT scores would be greater than 1,100. A single-tail hypothesis test also
means that we have only one critical region because we put the entire
critical region into just one side of the distribution. When the alternative
hypothesis is that the sample mean is greater, the critical region is on the
right side of the distribution (see below). When the alternative hypothesis is
that the sample is smaller, the critical region is on the left side of the
distribution.
Remember that there will be some sample means that are extremes – that
is going to happen about 5% of the time, since 95% of all sample means
fall within about two standard deviations of the mean. What happens if we
run a hypothesis test and we get an extreme sample mean? It won’t look
like our hypothesized mean, even if it comes from that distribution. We
would be likely to reject the null hypothesis. But we would be wrong.
When we decide to reject or not reject the null hypothesis, we have four
possible scenarios:
You should be able to recognize what each type of error looks like in a
particular hypothesis test. For example, suppose you are testing whether
listening to rock music helps you improve your memory of 30 random
objects. Assume further that it doesn’t. A Type I error would be concluding
that listening to rock music did help memory (but you are wrong). A Type I
error will only occur when your null hypothesis is false. Let’s assume that
listening to rock music does improve memory. In this scenario, if you
concluded that it didn’t, you would be wrong again. But this time you would
be making a Type II error — failing to find a significant difference when one
in fact exists.
It is also important that you realize that the chance of making a Type I error
is under our direct control. Often we establish the alpha level based on the
severity of the consequences of making a Type I error. If the consequences
are not that serious, we could set an alpha level at 0.10 or 0.20. In other
words, we are comfortable making a decision where we could falsely reject
the null hypothesis 10 to 20% of the time. However, in a field like medical
research, we would set the alpha level very low (at 0.001 for example) if
there was potential bodily harm to patients.
Confounding Variables
Another example is the relationship between the force applied to a ball and
the distance the ball travels. The natural prediction would be that the ball
given the most force would travel furthest. However, if the confounding
variable is a downward slanted piece of wood to help propel the ball, the
results would be dramatically different. The slanted wood is the
confounding variable that changes the outcome of the experiment.
Typical data format and the types of EDA The data from an experiment are
generally collected into a rectangular array (e.g., spreadsheet or database),
most commonly with one row per experimental subject and one column for
each subject identifier, outcome variable, and explanatory variable. Each
column contains the numeric values for a particular quantitative variable or
the levels for a categorical variable. (Some more complicated experiments
require a more complex data layout.)
People are not very good at looking at a column of numbers or a
whole spreadsheet and then determining important characteristics of the
data. They find looking at numbers to be tedious, boring, and/or
overwhelming. Exploratory data analysis techniques have been devised as
an aid in this situation. Most of these techniques work in part by hiding
certain aspects of the data while making other aspects more clear.
Exploratory data analysis is generally cross-classified in two ways.
First, each method is either non-graphical or graphical. And second, each
method is either univariate or multivariate (usually just bivariate).
Non-graphical methods generally involve calculation of summary
statistics, while graphical methods obviously summarize the data in a
diagrammatic or pictorial way. Univariate methods look at one variable
(data column) at a time, while multivariate methods look at two or more
variables at a time to explore relationships. Usually our multivariate EDA
will be bivariate (looking at exactly two variables), but occasionally it will
involve three or more variables. It is almost always a good idea to perform
univariate EDA on each of the components of a multivariate EDA before
performing the multivariate EDA.
Beyond the four categories created by the above cross-classification,
each of the categories of EDA have further divisions based on the role
(outcome or explanatory) and type (categorical or quantitative) of the
variable(s) being examined.
The four types of EDA are univariate non-graphical, multivariate
nongraphical, univariate graphical, and multivariate graphical.
Categorical data
The characteristics of interest for a categorical variable are simply the
range of values and the frequency (or relative frequency) of occurrence for
each value. (For ordinal variables it is sometimes appropriate to treat them
as quantitative variables using the techniques in the second part of this
section.) Therefore the only useful univariate non-graphical techniques for
categorical variables is some form of tabulation of the frequencies, usually
along with calculation of the fraction (or percent) of data that falls in each
category.
Central tendency
The central tendency or “location” of a distribution has to do with
typical or middle values. The common, useful measures of central tendency
are the statistics called (arithmetic) mean, median, and sometimes mode.
The arithmetic mean is simply the sum of all of the data values
divided by the number of values. It can be thought of as how much each
subject gets in a “fair” re-division of whatever the data are measuring. For
instance, the mean amount of money that a group of people have is the
amount each would get if all of the money were put in one “pot”, and then
the money was redistributed to all people evenly. For any symmetrically
shaped distribution (i.e., one with a symmetric histogram or pdf or pmf) the
mean is the point around which the symmetry holds. For non-symmetric
distributions, the mean is the “balance point”.
The median is another measure of central tendency. The sample
median is the middle value after all of the values are put in an ordered list.
If there are an even number of values, take the average of the two middle
values. (If there are ties at the middle, some special adjustments are made
by the statistical software we will use. In unusual situations for discrete
random variables, there may not be a unique median.)
For symmetric distributions, the mean and the median coincide. For
unimodal skewed (asymmetric) distributions, the mean is farther in the
direction of the “pulled out tail” of the distribution than the median is.
Therefore, for many cases of skewed distributions, the median is preferred
as a measure of central tendency. For example, according to the US
Census Bureau 2004 Economic Survey, the median income of US families,
which represents the income above and below which half of families fall,
was $43,318. This seems a better measure of central tendency than the
mean of $60,828, which indicates how much each family would have if we
all shared equally. And the difference between these two numbers is quite
substantial. Nevertheless, both numbers are “correct”, as long as you
understand their meanings.
The median has a very special property called robustness. A sample
statistic is “robust” if moving some data tends not to change the value of
the statistic. The median is highly robust, because you can move nearly all
of the upper half and/or lower half of the data values any distance away
from the median without changing the median. More practically, a few very
high values or very low values usually have no effect on the median.
A rarely used measure of central tendency is the mode, which is the
most likely or frequently occurring value. More commonly we simply use
the term “mode” when describing whether a distribution has a single peak
(unimodal) or two or more peaks (bimodal or multi-modal). In symmetric,
unimodal distributions, the mode equals both the mean and the median. In
unimodal, skewed distributions the mode is on the other side of the median
from the mean. In multi-modal distributions there is either no unique highest
mode, or the highest mode may well be unrepresentative of the central
tendency.
Spread
Several statistics are commonly used as a measure of the spread of
a distribution, including variance, standard deviation, and interquartile
range. Spread is an indicator of how far away from the center we are still
likely to find data values.
The variance is a standard measure of spread. It is calculated for a
list of numbers, e.g., the n observations of a particular measurement
labeled x1 through xn, based on the n sample deviations (or just
“deviations”). The variance of a population is defined as the mean squared
deviation. The sample formula for the variance of observed data
conventionally has n−1 in the denominator instead of n to achieve the
property of “unbiasedness”, which roughly means that when calculated for
many different random samples from the same population, the average
should match the corresponding population quantity. The most commonly
used symbol for sample variance is s2, which is essentially the average of
the squared deviations, except for dividing by n − 1 instead of n. This is a
measure of spread, because the bigger the deviations from the mean, the
bigger the variance gets. (In most cases, squaring is better than taking the
absolute value because it puts special emphasis on highly deviant values.)
Because of the square, variances are always non-negative, and they
have the somewhat unusual property of having squared units compared to
the original data. So if the random variable of interest is a temperature in
degrees, the variance has units “degrees squared”, and if the variable is
area in square kilometers, the variance is in units of “kilometers to the
fourth power”.
The standard deviation is simply the square root of the variance.
Therefore, it has the same units as the original data, which helps make it
more interpretable. The sample standard deviation is usually represented
by the symbol s.
The variance and standard deviation are two useful measures of
spread. The variance is the mean of the squares of the individual
deviations. The standard deviation is the square root of the variance. For
Normally distributed data, approximately 95% of the values lie within 2 sd
of the mean.
A third measure of spread is the interquartile range. To define IQR,
we first need to define the concepts of quartiles. The quartiles of a
population or a sample are the three values which divide the distribution or
observed data into even fourths. So one quarter of the data fall below the
first quartile, usually written Q1; one half fall below the second quartile
(Q2); and three fourths fall below the third quartile (Q3). The astute reader
will realize that half of the values fall above Q2, one quarter fall above Q3,
and also that Q2 is a synonym for the median. Once the quartiles are
defined, it is easy to define the IQR as IQR = Q3 − Q1. By definition, half of
the values (and specifically the middle half) fall within an interval whose
width equals the IQR. If the data are more spread out, then the IQR tends
to increase, and vice versa.
The IQR is a more robust measure of spread than the variance or
standard deviation. Any number of values in the top or bottom quarters of
the data can be moved any distance from the median without affecting the
IQR at all.
In contrast to the IQR, the range of the data is not very robust at all.
The range of a sample is the distance from the minimum value to the
maximum value: range = maximum - minimum. If you collect repeated
samples from a population, the minimum, maximum and range tend to
change drastically from sample to sample, while the variance and standard
deviation change less, and the IQR least of all.
Boxplots
Another very useful univariate graphical technique is the boxplot.
Boxplots are very good at presenting information about the central
tendency, symmetry and skew, as well as outliers, although they can be
misleading about some aspects. One of the best uses of boxplots is in the
form of side-by-side boxplots.
Important: The term “outlier” is not well defined in statistics, and the
definition varies depending on the purpose and situation. The “outliers”
identified by a boxplot, which could be called “boxplot outliers” are defined
as any points more than 1.5 IQRs above Q3 or more than 1.5 IQRs below
Q1. This does not by itself indicate a problem with those data points.
Boxplots are an exploratory technique, and you should consider
designation as a boxplot outlier as just a suggestion that the points might
be mistakes or otherwise unusual. Also, points not designated as boxplot
outliers may also be mistakes. It is also important to realize that the number
of boxplot outliers depends strongly on the size of the sample. For data that
is perfectly Normally distributed, we expect 0.70 percent (or about 1 in 150
cases) to be “boxplot outliers”, with approximately half in either direction.
The term fat tails is used to describe the situation where a histogram
has a lot of values far from the mean relative to a Gaussian distribution.
This corresponds to positive kurtosis. In a boxplot, many outliers (more
than the 1/150 expected for a Normal distribution) suggests fat tails
(positive kurtosis), or possibly many data entry errors. Boxplots are
excellent EDA plots because they rely on robust statistics like median and
IQR rather than more sensitive ones such as mean and standard deviation.
With boxplots it is easy to compare distributions with a high degree of
reliability because of the use of these robust statistics.
Quantile-normal plots
Cross-tabulation
For categorical data (and quantitative data with only a few different
values) an extension of tabulation called cross-tabulation is very useful. For
two variables, cross-tabulation is performed by making a two-way table with
column headings that match the levels of one variable and row headings
that match the levels of the other variable, then filling in the counts of all
subjects that share a pair of levels. The two variables might be both
explanatory, both outcome, or one of each. Depending on the goals, row
percentages (which add to 100% for each row), column percentages (which
add to 100% for each column) and/or cell percentages (which add to 100%
over all cells) are also useful. Cross-tabulation is the basic bivariate non-
graphical EDA technique.
There are few useful techniques for graphical EDA of two categorical
random variables. The only one used commonly is a grouped barplot with
each group representing one level of one of the variables and each bar
within a group representing the levels of the other variable.
Scatterplots
For two quantitative variables, the basic graphical EDA technique is
the scatterplot which has one variable on the x-axis, one on the y-axis and
a point for each case in your dataset. If one variable is explanatory and the
other is outcome, it is a very, very strong convention to put the outcome on
the y (vertical) axis.
One or two additional categorical variables can be accommodated on
the scatterplot by encoding the additional information in the symbol type
and/or color.
You should always perform appropriate EDA before further analysis of your data.
Perform whatever steps are necessary to become more familiar with your data, and
check for obvious mistakes. EDA is not an exact science – it is a very important
art!
Parametric tests and non-parametric tests
Common assumptions
Random sampling
The individuals or individual points in a sample should be selected by some
random process (e.g. a series of random numbers from a computer) in
such a way that every individual or point in the population has an equal
chance of being selected. If we do this, we might get mostly unusually large
or unusually small values in the sample. The tests assume samples have
been selected in this way and the probabilities given by the tests allow for
this.
Normal distributions
The Normal distributions assumption relates to the distributions of the
populations being studied, not the samples themselves. For us to accept this
assumption, it must be reasonable on theoretical grounds, i.e. we must expect that
values will be concentrated symmetrically round some mean value, and any
previous research should not contradict this.
The distribution of values in the samples should also appear approximately
Normal. Many computer packages will draw a frequency distribution for you (the
function is sometimes called Histogram). An advantage of this is that it gives you a
statistic with which to argue your case for accepting or rejecting that the data come
from a Normal distribution.
Equal variance
Equal variance is also sometimes referred to as homogeneous
variance, stable variance, constant variance, or homoscedasticity. What it
means is that to give accurate results, statistical tests often require the
'spread', technically the variance, of individual values to be the same in
each of the populations we are comparing. As for the assumption of
Normality, if we are to accept this assumption it must be reasonable on
theoretical grounds and not contradicted by the data in the samples.
Unequal variance occurs quite commonly because groups with high values
tend to have more spread in their values than groups with low values. For
example unfertilized plants might range in height from 10 to 15 cm,
whereas in a fertilized treatment most are between 20 and 30 cm. Not only
is the mean greater in the fertilized treatment, but also the spread of
values.
Learning Task:
1. What is hypothesis testing?
2. Discuss at least two examples of exploratory data analysis and
presentation.
3. Explain some of the common assumptions in statistical tests.
III- Statistical Methods
Learning Objectives:
1. Evaluate selected tests for nominal variables including exact test of
goodness-of-fit, power analysis, Chi-square tests of goodness-of-fit,
and Fisher’s exact test.
2. Evaluate some examples of descriptive statistics including measures
of central tendency, dispersion, standard error and confidence limits.
3. Evaluate selected tests for one measurement variables including one-
sample t-test, two-sample t-test, independence, normality.
The main goal of a statistical test is to answer the question, “What is the
probability of getting a result like my observed data, if the null hypothesis
were true?” If it is very unlikely to get the observed data under the null
hypothesis, you reject the null hypothesis. Most statistical tests take the
following form:
2. Calculate a number, the test statistic, that measures how far the
observed data deviate from the expectation under the null hypothesis.
• For a two-tailed test, which is what you almost always should use:
• You use the exact test of goodness-of-fit when you have one nominal
variable, you want to see whether the number of observations in each
category fits a theoretical expectation, and the sample size is small.
The most common use is a nominal variable with only two values (such as
male or female, left or right, green or yellow), in which case the test may be
called the exact binomial test.
You compare the observed data with the expected data, which are some
kind of theoretical expectation (such as a 1:1 sex ratio or a 3:1 ratio in a
genetic cross) that you determined before you collected the data.
• The null hypothesis is that the flies mate at random, so that there
should be equal numbers of homotypic and heterotypic matings.
Sign Test
• You use the sign test when there are two nominal variables and one
measurement variable. One of the nominal variables has only two
values, such as “before” and “after” or “left” and “right,” and the other
nominal variable identifies the pairs of observations.
• The data for a sign test usually could be analyzed using a paired
t–test if the null hypothesis is that the mean or median difference
between pairs of observations is zero.
• They found ten pairs of sister groups in which one group of related
species, or “clade,” fed on angiosperms and one fed on
gymnosperms, and they counted the number of species in each clade
• The basic procedure is the same as for the exact binomial test: you
calculate the probabilities of the observed result and all more extreme
possible results and add them together.
Assumptions
To give an example, let’s say you want to know what color of flowers that
bees like. You plant four plots of flowers: one purple, one red, one blue,
and one white. You get a bee, put it in a dark jar, carry it to a point
equidistant from the four plots of flowers, and release it.
You record which color flower it goes to first, then re-capture it and hold it
prisoner until the experiment is done.
In this case, the observations are independent; the fact that bee #1 went to
a blue flower has no influence on where bee #2 goes.
Now let’s say that you put a beehive at the point equidistant from the four
plots of flowers, and you record where the first 100 bees go.
If the first bee happens to go to the plot of blue flowers, it will go back to the
hive and do its bee-butt-wiggling dance that tells the other bees, “Go 15
meters southwest, there’s a bunch of yummy nectar there!”
Then some more bees will fly to the blue flowers, and when they return to
the hive, they’ll do the same bee-butt-wiggling dance.
Parameters
Effect size
The effect size is the minimum deviation from the null hypothesis that you
hope to detect.
Example:
• If you are treating hens with something that you hope will change the
sex ratio of their chicks, you might decide that the minimum change in
the proportion of sexes that you’re looking for is 10%.
• You would then say that your effect size is 10%. If you’re testing
something to make the hens lay more eggs, the effect size might be 2
eggs per month.
Alpha
• Alpha is the significance level of the test (the P value), the probability
of rejecting the null hypothesis even though it is true (a false positive).
Beta or power
Example
• You plan to cross peas that are heterozygotes for Yellow/green pea
color, where Yellow is dominant. The expected ratio in the offspring is
3 Yellow: 1 green.
• You want to know whether yellow peas are actually more or less fit,
which might show up as a different proportion of yellow peas than
expected. You arbitrarily decide that you want a sample size that will
detect a significant (P<0.05) difference if there are 3% more or fewer
yellow peas than expected, with a power of 90%.
You will test the data using the exact binomial test of goodness-of-fit if the
sample size is small enough, or a G–test of goodnessof-fit if the sample
size is larger. The power analysis is the same for both tests
• Using G*Power for the exact test of goodness-of-fit, the result is that
it would take 2190 pea plants if you want to get a significant (P<0.05)
result 90% of the time, if the true proportion of yellow peas is 78 or
72%.
• That’s a lot of peas, but you’re reassured to see that it’s not a
ridiculous number. If you want to detect a difference of 0.1% between
the expected and observed numbers of yellow peas, you can
calculate that you’ll need 1,970,142 peas.
• If that’s what you need to detect, the sample size analysis tells you
that you’re going to have to include a pea-sorting robot in your
budget.
An analysis which finds that the power was low should lead one to
regard the negative results as ambiguous, since failure to reject the null
hypothesis cannot have much substantive meaning when, even though the
phenomenon exists (to some given degree), the probability of rejecting the
null hypothesis was low.
X2 = (ad-bc)2 (a+b+c+d)
(a+b)(c+d)(b+d)(a+c)
Chi-square test is not suitable when the sample is small. For studies
with small samples, the best method to apply is the Fisher’s exact test.
Independence tests are used to determine if there is a significant
relationship between two categorical variables. There exists two different
types of independence test:
The Chi-square test is used when the sample is large enough (in this
case the p-value is an approximation that becomes exact when the sample
becomes infinite, which is the case for many statistical tests). On the other
hand, the Fisher’s exact test is used when the sample is small (and in this
case the p-value is exact and is not an approximation).
The statistical hypothesis that can be formulated for Fisher exact test
is exactly the same as that for chi-square test. If the computed probability
value of the Fisher exact test is less than the standard cut-off p-value of
0.05, then we reject the null hypothesis and conclude that there is an
association between the column variable and the row variable or the
proportion with the characteristic of interest is not the same in both
populations.
The Fisher’s exact test calculates the exact probability of the table of
observed cell frequencies given the following assumptions:
a b a+b
c d c+d
a+c b+d n
If margins of a table are fixed, the exact probability of a table with cells
a,b,c,d and marginal totals (a+b), (c+d), (a+c), (b+d) =
(a+b)!*(c+d)!*(a+c)!*(b+d)!
n!*a!*b!*c!*d!*
1 8 9
4 5 9
5 13 18
18! * 1! * 8! * 4! * 5! = 1028160
The p-value for the Fisher’s exact test is calculated by summing all
probabilities less than or equal to the probability of the observed table.
„ The probability is smallest for the tables that are least likely to occur
by chance if the null hypothesis of independence is true.
Hypotheses
The hypotheses of the Fisher’s exact test are the same than for the Chi-
square test, that is:
H0: the variables are independent, there is no relationship between
the two categorical variables. Knowing the value of one variable does
not help to predict the value of the other variable
H1: the variables are dependent, there is a relationship between the
two categorical variables. Knowing the value of one variable helps to
predict the value of the other variable
Remember that the Fisher’s exact test is used when there is at least one
cell in the contingency table of the expected frequencies below 5. Fisher's
exact test is practically applied only in analysis of small samples but
actually it is valid for all sample sizes. While the chi-squared test relies on
an approximation, Fisher's exact test is one of exact tests. Especially when
more than 20% of cells have expected frequencies < 5, we need to use
Fisher's exact test because applying approximation method is inadequate.
Central Tendency
How well did my students do on the last test? What is the average
price of gasoline in the Phoenix metropolitan area? What is the mean
number of home runs hit in the National League? These questions are
asking for a statistic that describes a large set of data. In this section we
will study the mean, median, and mode. These three statistics describe an
average or center of a distribution of numbers.
Sigma notation Σ
• Given a sample of n data points, x1, x2, x3, … xn, the formula for the
mean or average is given below.
• My 5 test scores for Calculus I are 95, 83, 92, 81, 75. What is the
mean?
• ANSWER: sum up all the tests and divide by the total number of
tests.
• Test mean = (95+83+92+81+75)/5 = 85.2
When you are given a range of data, you need to find midpoints. To find a
midpoint, sum the two endpoints on the range and divide by 2. Example
14≤x<18. The midpoint (14+18)/2=16. The total number of students is
5,542,000.
The median
How do you find the median? First, if possible or feasible, arrange the data
from smallest value to largest value. The location of the median can be
calculated using this formula: (n+1)/2. If (n+1)/2 is a whole number then
that value gives the location. Just report the value of that location as the
median.
If (n+1)/2 is not a whole number then the first whole number less than the
location value and the first whole number greater than the location value
will be used to calculate the median. Take the data located at those 2
values and calculate the average, this is the median.
The mode
Dispersion
Range: This is simply the difference between the largest and smallest
observations. This is the statistic of dispersion that people use in everyday
conversation; if you were telling your Uncle about your research on the
giant deep-sea isopod Bathynomus giganteus, you wouldn't blather about
means and standard deviations, you'd say they ranged from 4.4 to 36.5 cm
long (Biornes-Fourzán and Lozano-Alvarez 1991). Then you'd explain that
isopods are roly-polies, and 36.5 cm is about 14 American inches, and
Uncle would finally be impressed, because a roly-poly that's over a foot
long is pretty impressive.
There are a number of web pages that calculate range, variance, and
standard deviation, along with other descriptive statistics. Some of them
are given below.
SAS
PROC UNIVARIATE will calculate the range, variance, standard
deviation and coefficient of variation. It calculates the sample variance and
sample standard deviation.
Standard Error
The standard error (to be more precise, the standard error of the
mean) is a property of our estimate of the mean. The SEM is equal to the
SD divided by the square root of n.
This quantity tells us how our estimate of the mean will vary from
sample to sample (these are theoretical samples, if we could redo our
exact study many times and compute the sample mean over and over
again and look at how it varies). It is a summary of how precise our
estimate is (as we expect, as sample size increases, our ability to estimate
the mean precisely improves, so the SEM decreases). See the difference?
Standard Deviation SD is concerned with the scatter of individual data
points in the population, while the SEM is concerned with the variability of
our estimate of the mean.
It is clear that the SEM will always be smaller than the SD (which is
not a function of sample size).
Step 4: Sum the squared deviations (Add up the numbers from step
3).
Step 5: Divide that sum from step 4 by one less than the sample size
(n-1, that is, the number of measurements minus one)
Step 6: Take the square root of the number in step 5. That gives you
the "standard deviation (S.D.)."
Step 8: Subtract the standard error from the mean and record that
number. Then add the standard error to the mean and record that number.
You have plotted mean± 1 standard error (S. E.), the distance from 1
standard error below the mean to 1 standard error above the mean.
Confidence limits for the mean (Snedecor and Cochran, 1989) are an
interval estimate for the mean. Interval estimates are often desirable
because the estimate of the mean varies from sample to sample. Instead of
a single estimate for the mean, a confidence interval generates a lower and
upper limit for the mean. The interval estimate gives an indication of how
much uncertainty there is in our estimate of the true mean. The narrower
the interval, the more precise is our estimate.
From the formula, it is clear that the width of the interval is controlled by two
factors:
That is, one way to obtain more precise estimates for the mean is to
increase the sample size.
2. The larger the sample standard deviation, the larger the confidence
interval. This simply means that noisy data, i.e., data with a large
standard deviation, are going to generate wider intervals than data
with a smaller standard deviation.
Confidence limits for the mean can be used to answer the following
questions:
Confidence limits for the mean are available in just about all general-
purpose statistical software programs. Both Dataplot code and R code can
be used to generate the analyses in this section.
Learning Task:
Data Structure
For this procedure, the data are entered as a single column and
specified as a response variable. Multiple columns can be analyzed
individually during the same run if multiple response variables are
specified.
Weight
159
155
157
125
103
122
101
82
228
199
195
110
191
151
119
119
112
87
190
87
Setup
To run this example, complete the following steps:
1. Open the Weight example dataset
• From the File menu of the NCSS Data window, select Open Example
Data.
• Select Weight and click OK.
2. Specify the One-Sample T-Test procedure options
• Find and open the One-Sample T-Test procedure using the menus or the
Procedure Navigator.
• The settings for this example are listed below and are stored in the
Example 1 settings template. To load this template, click Open Example
Template in the Help Center or File menu.
Option Value
Variable(s).............................................................Weight
Reports Tab
Limits .....................................................................Two-Sided
σ.............................................................................. 40
H0 μ = .................................................................... 130
Plots Tab
• Click the Run button to perform the calculations and generate the output
T-Test Section
This section presents the results of the traditional one-sample T-test. Here,
reports for all three alternative hypotheses are shown, but a researcher
would typically choose one of the three before generating the output. All
three tests are shown here for the purpose of exhibiting all the output
options available.
T-Statistic
The T-Statistic is the value used to produce the p-value (Prob Level) based
on the T distribution.
d.f.
The degrees of freedom define the T distribution upon which the probability
values are based. The formula for the degrees of freedom is: 𝑑f= 𝑛 – 1
Prob Level
The probability level, also known as the p-value or significance level, is the
probability that the test statistic will take a value at least as extreme as the
observed value, assuming that the null hypothesis is true. If the p-value is
less than the prescribed α, in this case 0.05, the null hypothesis is rejected
in favor of the alternative hypothesis. Otherwise, there is not sufficient
evidence to reject the null hypothesis.
Reject H0 at α = (0.050)
You can use the test when your data values are independent, are
randomly sampled from two normal populations and the two independent
groups have equal variances.
What do we need?
Example:
Our sample data is from a group of men and women who did
workouts at a gym three times a week for a year. Then, their trainer
measured the body fat. The table below shows the data.
The data values are independent. The body fat for any one person
does not depend on the body fat for another person.
We assume the people measured represent a simple random sample
from the population of members of the gym.
We assume the data are normally distributed, and we can check this
assumption.
The data values are body fat measurements. The measurements are
continuous.
We assume the variances for men and women are equal, and we can
check this assumption.
Figure 1: Histogram and summary statistics for the body fat data
The two histograms are on the same scale. From a quick look, we
can see that there are no very unusual points, or outliers. The data look
roughly bell-shaped, so our initial idea of a normal distribution seems
reasonable.
Data format
There are two ways to enter and store the data for two-sample t-test.
The first is very similar to one-sample t-tests, except you have two columns
instead of one. One column of each sample. This first method is simplier to
use with the function t.test(). However, it often confuses students when
they think about whether the predictor and response variables are
continuous or categorical. The second is more typical of how researchers
store data. There are also two columns, but one column specifies the
sample (e.g., group 1 or group 2) and the other specifies the data you will
analyze. This format makes it clear that we have two variables, one is
categorical and the other is continuous. However, it requires a little more
effort to pull out the data for each group. Below are examples of both.
You can enter your data directly into R or enter into Excel and use the
function read.csv to get it into R. Below I have entered the data directly into
R with the function c() and then used the function data.frame() to create a
data.frame. I named the columns of the data Sample1 and Sample2.
First format
s1 <- c(23, 45, 34, 37, 29, 44, 40, 34)
s2 <- c(12, 20, 19, 18, 22, 14, 17, 17)
(twoSampleData <- data.frame(Sample1 = s1, Sample2 = s2))
## Sample1 Sample2
## 1 23 12
## 2 45 20
## 3 34 19
## 4 37 18
## 5 29 22
## 6 44 14
## 7 40 17
## 8 34 17
#Second format
sample <- rep(c("s1", "s2"), each = 8)
data <- c(23, 45, 34, 37, 29, 44, 40, 34, 12, 20, 19, 18, 22, 14, 17, 17)
(twoSampleData2 <- data.frame(Sample = sample, Data = data))
## Sample Data
## 1 s1 23
## 2 s1 45
## 3 s1 34
## 4 s1 37
## 5 s1 29
## 6 s1 44
## 7 s1 40
## 8 s1 34
## 9 s2 12
## 10 s2 20
## 11 s2 19
## 12 s2 18
## 13 s2 22
## 14 s2 14
## 15 s2 17
## 16 s2 17
Both formats have exactly the same data, and you can convert one to the
other.
Below is how to pull data from each format.
#Pull data from the first format
twoSampleData$Sample1
## [1] 23 45 34 37 29 44 40 34
twoSampleData$Sample2
## [1] 12 20 19 18 22 14 17 17
## [1] 23 45 34 37 29 44 40 34
P-value
We look up the P-value exactly the same way as for the one-sample t-test.
We need to know whether the test is a 1- or 2-tailed test. Then we can
calculate the probability of getting our t-value or something more extreme
for the appropriate degrees of freedom. Let us assume that we are
interested in just a difference between our samples or groups, and thus the
test should be 2-tailed (because the biological hypotheses are two-sided).
2*pt(-tTwoSample, length(twoSampleData$Sample1) +
length(twoSampleData$Sample2) - 2)
## [1] 1.612091e-05
## [1] 6.893631
## [1] 0.9998837
##
## Paired t-test
##
## data: twoSampleData$Sample1 and twoSampleData$Sample2
## t = 6.8936, df = 7, p-value = 0.9999
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
## -Inf 23.42501
## sample estimates:
## mean of the differences
## 18.375
We didn’t need to worry about the argument var.equal because the test is
really a one-sample test there is really only one variable (the difference).
We could also analyze the difference with a one-sample t-test, which will
give us the same answer as above.
t.test(dif, alternative = "less")
##
## One Sample t-test
##
## data: dif
## t = 6.8936, df = 7, p-value = 0.9999
## alternative hypothesis: true mean is less than 0
## 95 percent confidence interval:
## -Inf 23.42501
## sample estimates:
## mean of x
## 18.375
If your calculated t value is greater than the critical T-value from the
table, you can conclude that the difference between the means for
the two groups is significantly different. We reject the null hypothesis
and conclude that the alternative hypothesis is correct.
If your calculated t value is lower than the critical T-value from the
table, you can conclude that the difference between the means for
the two groups is NOT significantly different. We accept the null
hypothesis.
Independence
Measurement variables
Nominal variables
It is not easy to look at your data and see whether the data are non-
independent. You need to understand the biology of your organisms and
carefully design your experiment so that the observations will be
independent. For your comparison of the weights of calico cats vs. black
cats, you should know that cats from the same litter are likely to be similar
in weight; you could therefore make sure to sample only one cat from each
of many litters. You could also sample multiple cats from each litter but
treat "litter" as a second nominal variable and analyze the data
using nested anova. For Sally the tiger, you might know from previous
research that bouts of activity or inactivity in tigers last for 5 to 10 minutes,
so that you could treat one-minute observations made an hour apart as
independent. Or you might know from previous research that the activity of
one tiger has no effect on other tigers, so measuring activity of five tigers at
the same time would actually be okay.
For regression and correlation analyses of data collected over a length
of time, there are statistical tests developed for time series.
Normality
Most tests for measurement variables assume that data are normally
distributed (fit a bell-shaped curve).
Other data sets don't fit the normal distribution very well. The histogram
on the top is the level of sulphate in Maryland streams (data from
the Maryland Biological Stream Survey). It doesn't fit the normal curve very
well, because there are a small number of streams with very high levels of
sulphate. The histogram on the bottom is the number of egg masses laid by
indivuduals of the lentago host race of the
treehopper Enchenopa (unpublished data courtesy of Michael Cast). The
curve is bimodal, with one peak at around 14 egg masses and the other at
zero.
Parametric tests assume that your data fit the normal distribution. If your
measurement variable is not normally distributed, you may be increasing
your chance of a false positive result if you analyze the data with a test that
assumes normality.
A histogram with a long tail on the right side, such as the sulphate data
above, is said to be skewed to the right; a histogram with a long tail on the
left side is said to be skewed to the left. There is a statistic to describe
skewness, g1; there is no rule of thumb that you shouldn't do a parametric
test if g1 is greater than some cutoff value.
Another way in which data can deviate from the normal distribution is
kurtosis. A histogram that has a high peak in the middle and long tails on
either side is leptokurtic; a histogram with a broad, flat middle and short
tails is platykurtic. The statistic to describe kurtosis is g2.
Definition A test that allows one to A test that allows one to make
make comparisons comparisons between the
between the means of means of three or more
three or more groups of groups of data, where two
data. independent variables are
considered.
Use one-way anova when you have one nominal variable and one
measurement variable; the nominal variable divides the measurements into
two or more groups. It tests whether the means of the measurement
variable are the same for the different groups.
Use two-way anova when you have one measurement variable and two
nominal variables, and each value of one nominal variable is found in
combination with each value of the other nominal variable. It tests three null
hypotheses: that the means of the measurement variable are equal for
different values of the first nominal variable; that the means are equal for
different values of the second nominal variable; and that there is no
interaction (the effects of one nominal variable don't depend on the value of
the other nominal variable).
Salvatore Mangiafico's R Companion has a sample R program for one-
way and two-way anova.
Introduction to correlation
3.1 and
78 linear regression
6 87
6.9 90
7.7 92
8.7 97
12.4 108
15.3 119
There are three things you can do with this kind of data. One is a
hypothesis test, to see if there is an association between the two variables;
in other words, as the X variable goes up, does the Y variable tend to
change (up or down). For the exercise data, you'd want to know whether
pulse rate was significantly higher with higher speeds. The P value is
1.3×10−8, but the relationship is so obvious from the graph, and so
biologically unsurprising (of course my pulse rate goes up when I exercise
harder!), that the hypothesis test wouldn't be a very interesting part of the
analysis.
The second goal is to describe how tightly the two variables are
associated. This is usually expressed with r, which ranges from −1 to 1,
or r2, which ranges from 0 to 1. For the exercise data, there's a very tight
relationship; this means that if you knew my speed on the elliptical
machine, you'd be able to predict my pulse quite accurately.
The final goal is to determine the equation of a line that goes through
the cloud of points. The equation of a line is given in the form Ŷ=a+bX,
where Ŷ is the value of Y predicted for a given value of X, a is
the Y intercept (the value of Y when X is zero), and b is the slope of the line
(the change in Ŷ for a change in X of one unit). For the exercise data, the
equation is Ŷ=63.5+3.75X; this predicts that my pulse would be 63.5 when
the speed of the elliptical machine is 0 kph, and my pulse would go up by
3.75 beats per minute for every 1 kph increase in speed. This is probably
the most useful part of the analysis for the exercise data; if I wanted to
exercise with a particular level of effort, as measured by pulse rate, I could
use the equation to predict the speed I should use.
When to use them
Use correlation/linear regression when you have two measurement
variables, such as food intake and weight, drug dosage and blood
pressure, air temperature and metabolic rate, etc.
There's also one nominal variable that keeps the two measurements
together in pairs, such as the name of an individual organism, experimental
trial, or location. I'm not aware that anyone else considers this nominal
variable to be part of correlation and regression, and it's not something you
need to know the value of—you could indicate that a food intake
measurement and weight measurement came from the same rat by putting
both numbers on the same line, without ever giving the rat a name. For that
reason, I'll call it a "hidden" nominal variable.
The main value of the hidden nominal variable is that it lets me make
the blanket statement that any time you have two or more measurements
from a single individual (organism, experimental trial, location, etc.), the
identity of that individual is a nominal variable; if you only have one
measurement from an individual, the individual is not a nominal variable.
There are three main goals for correlation and regression in biology.
One is to see whether two measurement variables are associated with
each other; whether as one variable increases, the other tends to increase
(or decrease). You summarize this test of association with the P value. In
some cases, this addresses a biological question about cause-and-effect
relationships; a significant association means that different values of the
independent variable cause different values of the dependent. An example
would be giving people different amounts of a drug and measuring their
blood pressure. The null hypothesis would be that there was no relationship
between the amount of drug and the blood pressure. If you reject the null
hypothesis, you would conclude that the amount of drug causes the
changes in blood pressure. In this kind of experiment, you determine the
values of the independent variable; for example, you decide what dose of
the drug each person gets. The exercise and pulse data are an example of
this, as I determined the speed on the elliptical machine, then measured
the effect on pulse rate.
Use Spearman rank correlation when you have two ranked variables,
and you want to see whether the two variables covary; whether, as one
variable increases, the other variable tends to increase or decrease. You
also use Spearman rank correlation if you have one measurement
variable and one ranked variable; in this case, you convert the
measurement variable to ranks and use Spearman rank correlation on the
two sets of ranks.
For example, Melfi and Poyser (2007) observed the behavior of 6 male
colobus monkeys (Colobus guereza) in a zoo. By seeing which monkeys
pushed other monkeys out of their way, they were able to rank the
monkeys in a dominance hierarchy, from most dominant to least dominant.
This is a ranked variable; while the researchers know that Erroll is
dominant over Milo because Erroll pushes Milo out of his way, and Milo is
dominant over Fraiser, they don't know whether the difference in
dominance between Erroll and Milo is larger or smaller than the difference
in dominance between Milo and Fraiser. After determining the dominance
rankings, Melfi and Poyser (2007) counted eggs of Trichuris nematodes per
gram of monkey feces, a measurement variable. They wanted to know
whether social dominance was associated with the number of nematode
eggs, so they converted eggs per gram of feces to ranks and used
Spearman rank correlation. For the Colobus monkey example, Spearman's
ρ is 0.943, and the P value from the table is less than 0.025, so the
association between social dominance and nematode eggs is significant.
Erroll 1 5777 1
Milo 2 4225 2
Fraiser 3 2674 3
Fergus 4 1249 4
Kabul 5 749 6
Hope 6 870 5
Null hypothesis
The null hypothesis is that the Spearman correlation coefficient, ρ
("rho"), is 0. A ρ of 0 means that the ranks of one variable do not covary
with the ranks of the other variable; in other words, as the ranks of one
variable increase, the ranks of the other variable do not increase (or
decrease).
Assumption
When you use Spearman rank correlation on one or two measurement
variables converted to ranks, it does not assume that the measurements
are normal or homoscedastic. It also doesn't assume the relationship is
linear; you can use Spearman rank correlation even if the association
between the variables is curved, as long as the underlying relationship is
monotonic (as X gets larger, Y keeps getting larger, or keeps getting
smaller). If you have a non-monotonic relationship (as X gets larger, Y gets
larger and then gets smaller, or Y gets smaller and then gets larger, or
something more complicated), you shouldn't use Spearman rank
correlation.
Example
1760 529
3010 484
2040 566
3080 527
2440 473
3370 488
2550 461
3740 485
2730 465
4910 478
5090 434
5090 468
5380 449
5850 425
6730 389
6990 421
7960 416
Magnificent frigatebird, Fregata magnificens.