Professional Documents
Culture Documents
Biologists
By Dillon Jones
Preface
Howdy! The text below provides a conceptual understanding of statistics with a specific
focus on analyzing biological data in the R coding environment. While teaching
biostatistics at the university level, I found that my students had 2 main struggles: not
understanding core statistical concepts and making simple errors in R. This text aims to
tackle the former problem. If you’re interested in solving the latter, check out my
Fundamentals of R for Biologists material.
Interestingly, I found that mathematical knowledge was rarely a barrier. This realization
surprised me as statistics has traditionally been taught under a math focused
curriculum. Memorizing equations, plugging in variables, and keeping track of often
confusing symbols.
In my opinion, this is why statistics, despite being so critical to all our scientific
endeavors, is often the weakest skill among scientists.
This text, Fundamental Biostatistics, gives its readers the conceptual understanding
needed to actually understand statistical theory and then apply that knowledge to
biological datasets. Mathematics is largely ignored in this text in favor of letting R run
the calculations and statistical tests.
For many, understanding basic statistical concepts and being able to interpret results
using those concepts is all that is ever needed. After all, with tools such as R, online
calculators, and collaborators from the statistics department, many scientists don’t
require in-depth statistical analyses to run and interpret simple t-tests, correlations, and
linear regressions. This book is tailor suited for those individuals.
However, for those who do want to understand the mathematics, it is nearly impossible
to do so without understanding the concepts laid out here. If this is you, this text should
serve as a spring board into more advanced discussions of statistics.
Introduction to statistics
What is Statistics?
Statistics is a branch of mathematics that involves collecting, analyzing, and interpreting
data. It provides the tools that let us make sense of data, interpret results from
analyses, and test hypotheses. In the context of biology, statistics is used to design and
analyze experiments, draw conclusions from biological data, and communicate our
results to others.
This course covers the basics of statistics with a focus on a conceptual understanding,
rather than a mathematical understanding. We will cover common statistical tests used
in biological studies and their interpretation.
Lastly, statistics can be performed in a wide range of software and even by hand.
However, this course covers statistics using the R programming language. While we do
not have time to cover the basics of R, we will keep the explanations and code as
straightforward as possible.
Hypothesis testing
Hypothesis testing is an essential component of statistics and a proper understanding is
critical to understanding our statistical analyses. Typically, we utilize a null and
Lets show some examples. On a probability distribution the y-axis indicates the
probability density, or how probable a given event is, and the x-axis corresponds to
some outcome.
Below is a figure displaying the # of bird species detected in 100 different plots. Some
plots had 2 species while others had 3 species or 4 species and so on and so forth. We
can then create a probability distribution that shows the proportional amount of times
that each # of species was detected. In the example below, 30 out of 100 plots had 5
species, so it would have a proportional frequency of .3.
But we are not limited to single values! We can ask “What is the probability of a plot
having 5 or more bird species?”
We can actually sum the probabilities of having 5, 6, and 7 bird species together and
come up with .6 (60%)!
The smaller the p-value, the stronger the evidence against the null hypothesis and the
more confident we can be in rejecting it.
In most biological studies, we use a P-value threshold of .05. Meaning, that at a p value
of .05 we would have a false positive from random noise about 5% of the time.
Any value equal to or lower than .05 tells us that our data is significant and we are able
to reject our null hypothesis. This .05 value is linked directly to the common 95%
confidence interval. Keep in mind that this threshold is arbitrary, and many statisticians
recommend using a stricter threshold such as .01.
The p-value and significance is also the most misunderstood metric in statistics. As we’ll
discover in our next lesson, significant results do NOT mean meaningful results. Rather,
they tell us “we have enough data to detect a difference”.
Say we went and measured the toe length from 2 populations of lizards. After collecting
thousands of measurements, we find a significant difference between the two
populations.
While both results allow us to reject the null hypothesis, because they are both
significant, we would use our understanding of the biology to say that one is biologically
Cohen’s d is a measure of effect size that describes the difference between two
means. Generally, a Cohen’s d of 0.3-0.5 is considered a small effect size, 0.5-0.8 a
moderate effect size, and 0.8 or higher a large effect size. Cohen’s d is calculated by
taking the difference between the means of two groups and dividing it by the pooled
standard deviation.
With that calculation, we can say that a Cohen’s D of 1, indicates that the mean
difference between populations is equal to 1 standard deviation. A Cohen’s D of .5, tells
us that the mean difference is only half a standard deviation.
There are many metrics that can be used for determining effect size and they depend
on which statistical test you are performing. Another common one is the correlation
coefficient. We’ll cover it in depth during our correlation section, but it is a value from -1
to 1 that tells us the direction and strength of the relationship between 2 continuous
variables. We can show that relationship according to how tightly clustered points are to
a line of best fit.
Binomial Distribution
The binomial distribution describes the frequency of “successful outcomes” given some
dependent variable. Owing true to it’s name, binomial distributions are binary in that
they describe only 1 of 2 outcomes: yes-no, success-fail, presence-absence. Binary
Along the x axis, we describe the number of successful (or unsuccessful) trials during
our study. A biological example may be probability of survival for individuals per year.
We could have the probability of surviving 1 years, 2 years, 3 years etc. etc. until we
reach a known maximum. This distribution is binomial because at each year, the
outcomes are only 1 of 2 options: survive or don’t survive. Another example may be the
presence (or absence) of a particular gene in multiple populations, where on the x axis
we would have the number of individuals in each population with that genotype.
Normal Distribution
The normal distribution is a continuous probability distribution that is often used in
biology as well as many other fields. The normal distribution is characterized by its bell-
shaped curve, with the mean, median, and mode all being equal. Many biological
measurements, such as the height of individuals in a population or the weight of seeds
produced by a plant, can be modeled using the normal distribution. The normal
distribution is important in statistical analysis because many statistical tests rely on the
assumption that the data is normally distributed.
Normal distribution
The normal distribution assumption is a common assumption in statistical tests. As a
reminder, a normal distribution is continuous data that where the mean, median, and
mode are all equal typically displaying as a bell-shaped curve. Normally distributed data
is probably the most important assumption as many statistical tests are designed with
this distribution in mind.
Normality can be checked in R through visual checks like QQ-plots or statistical tests
like the Shapiro-Wilk test. Here we will break down how to interpret each of these tests.
A QQ-plot can be generated for continuous data in r using the following code:
Below, we have normally distributed data. We’ll talk about the Shapiro Wilk Test in just a
moment, so for now look at the shape of the data on the left and the QQ-plot on the
right.
set.seed(123456)
for(i in 1:1000){#simple function for pulling a random number and adding it to i. This dat
a will not be normal.
y[i] <- runif(1,0,100)+i
}
Here we have our first clue as to if our data is normal or not. The plot on the left shows
multiple peaks and we suspect is not normal. Looking at the QQ-plot reveals that many
points deviate away from our line, presenting as an s-shaped curve. These 2 lines of
evidence indicate that the data may not be normally distributed.
But, looking at plots is rather subjective and 2 different scientists could come to different
conclusions about the same data. Lets talk about the Shapiro-Wilks test.
This test is very easy to run and gives us an objective way to determine if some data is
normally distributed or not. But many newcomers interpret the results from the test
incorrectly. Lets recall our hypothesis testing and p-value interpretations.
The Null hypothesis for the Shapiro-Wilk test is that “The data is normally distributed”
The Alternative hypothesis for the Shapiro-Wilk test is that “The data deviates from a
normal distribution”
Thus, we want our p-value to NOT be significant (p >.05) if we are looking for normally
distributed data. The below code runs the Shapiro-Wilk test for each data x and y
above.
Our graphs above already contain the p-values for each data in the title. We can see
that the normally distributed data has a p-value of .7595. Far from our significance cutoff
of .05, indicating that the test is insignificant and we fail to reject the null hypothesis.
The data is normal
The second graph has a p-value far below .05 (<.00000000001), indicating that our
results are significant and we can reject the null hypothesis. The data is NOT normal.
Using visual inspections of the data, QQ-plots, and Shapiro-wilk test enable us to see if
the data is normally distributed or not. If the data is not normal, data transformation is
often the first step and is explained in a future lesson. We can also use non-parametric
tests, which are tests that do not require a normal distribution. While not covered in this
course, see the bonus material section to find more information on these types of tests.
Homogeneity of variances
Homogeneity of variances assumes the variance of the dependent variable is equal
across different groups in our data. In biostatistics, this assumption is important because
many statistical tests, such as ANOVA, require that the variance of the dependent
variable is equal across all groups being compared.
e.g. lets say we are analyzing average height for 3 different populations of Snake plant.
The ANOVA test, which compares means between groups may be well suited for this
analysis, but it assumes that the variation of that height is equal between each
population. The mean height for each group might differ, but the variation around those
means should not.
We’ll explore Homogeneity of variances by examining our data, look at Residuals vs
Fitted plots and the Bartlett’s test.
The below code creates normally distributed data split into 3 groups with equal
variance. The Residuals vs Fitted plots can be obtained by plotting an ANOVA output
Looking at the boxplot on the right, we can see that the variances for each group look
approximately equal. Specifically, we look at the whiskers of the boxplot and see that
each pair spans approximately the same distance.
The residuals vs fitted plot on the left shows a similar pattern. Each cluster of points
(seen at x = ~ 19, ~31, and ~40) relates to each of our groups of data. Those points
describe the spread of data for each group, while the red line follows the center of each
group.
Here we look for 2 things:
set.seed(123456)
We can clearly see that there are different variances between the groups in our boxplot.
We should also be aware that each of the groups have the same mean as their
counterparts in the previous graph (e.g. Group A has a mean of 30 for both
homogenous and non-homogenous data). The only difference between these datasets
is the variances.
Running these analyses over our data, we get a p-value of .8826 for our first data set
meaning our data is not signficantly different and we fail to reject the null hypothesis.
Our variances are equal between groups.
For our second data set, the p value is <.0000001. Far below our alpha cutoff of .05.
Thus, our data is significantly different and we reject the null hypothesis. Our variances
are NOT equal between groups.
Independence
Independence assumes that the values being measured or observed are not related to
one another in any systematic way. Independence usually is accounted for during the
data collection process of the study by ensuring the data collected is not dependent on
other variables. However, many metrics are by their very nature non-independent. For
example, temperature and rainfall can be correlated if you are comparing temperate
forests to deserts.
set.seed(123456)
library(tidyverse)
#Create the dataframe. Elevation and forest_cover are intentionally dependent on other ind
ependent variables
df <- data.frame(richness = 1:100)%>%
mutate(temp = (richness/2)+runif(100,20,40),
humidity = (richness/10)+runif(100,60,90),
elevation = (temp*10)+runif(100,200,400),
forest_cover = (humidity*.1)+runif(100,20,90))
cor(df$temp,df$humidity) #Significant
## [1] 0.4872549 #Below our .5 cutoff. Independent.
cor(df$temp,df$elevation) #Significant
## [1] 0.9360444 #Above our .5 cutoff. Not independent.
Based on our results, we have 2 significant correlations. One being the interaction
between temperature and humidity and the other being temperature and elevation.
However, using .5 as our cutoff value, only the interaction between temperature and
elevation is considered not independent. Note that .5 is an arbitrary cutoff and can
change depending on your study.
The other interactions (Temp x forest cover and forest cover x humidity) are not
significantly correlated, meaning they are independent of one another. Of course, for a
simple dataset with only a few variables, running the correlation for each pair is fairly
straightforward.
We can also create a correlation matrix by supplying the cor() function with all of our
variables we wish to compare. This will run a correlation for every pair of variables.
Here, the values above our .5 cutoff are bolded. Note that the center diagonal line is all
1 and the correlations are mirrored across that line. Richness is also included, but
cor(df)
richness temp humidity elevation forest_cover
richness 1.00000000.91063480.502393390.8731024 -0.14054310
temp 0.9106348 1.0000000 0.487254860.9360444 -0.14483789
humidity0.5023934 0.4872549 1.000000000.5257696 -0.09998881
elevation 0.87310240.93604440.52576956 1.0000000 -0.11944915
forest_cover -0.1405431 -0.1448379 -0.09998881 -0.1194492 1.00000000
Another method is the VIF function. We supply the VIF function with a simple model and
from there it identifies non-independent data.
#Another function from the cars package, VIF makes this process much easier. Values above
5 are considered dependent on one another.
Notice how now temperature is below 5. This shows us that the variables are no longer
dependent on one another and we can continue to run our statistical tests! If our study
did want to see how elevation affects richness, we could rerun the analysis without the
temperature data.
Data Transformations
1. Test if your data needs transforming: If your data is already normally distributed for
example, there is no need to transform it
3. Transform the data: Apply the chosen transformation to the data. This can be easily
done in R.
4. Check the assumptions: After transforming the data, check if the assumptions of
your statistical test are met. If not, try a different transformation or use a different
statistical test.
5. Perform the statistical test: Once the data meets the assumptions of the statistical
test, perform the test!
The code below shows a typical workflow for transforming data that does not meet the
requirements for normality.
set.seed(123456)
plot(density(x), xlab = "Shapiro Test P value: <0.0001", main = "Not Normal Data") #plot
shapiro.test(x) #Shapiro test. Remember, nonsignificant is what we want
In the code above, we have some data that is heavily right skewed. Naturally, our
Shapiro-Wilk test for normality returns with a highly significant p value (.0001).
Remember, the null hypothesis for this test is “Data is normally distributed”, so by failing
to reject our null hypothesis we are say there is evidence for the alternative (Data is not
normally distributed).
After log transforming the data with the code above and rerunning the shapiro-wilk test,
we get what appears to be a normal distribution! The data follows a bell shaped curve
and our p-value is not significant. If our statistical test requires a normal distribution, this
would be a very valid transformation to apply!
x <- some_not_normal_data
While we could report the mean of each log transformed group, it wouldnt make much
sense! What exactly does the log transformed frog abundance mean anyway? We can
see there are MORE frogs in lower streams, but no intuitive understanding of how many
more.
This is where we would back transform the data
The back transform, it really is as easy as reversing the transformation. For example, a
log transformation in R can be undone with a exp() function as the log() function is a
natural log. In the table below, we do 10^x as our data transformation, as our log
transformation used base 10. In terms of abundance, this converts our mean
abundance to 17 frogs in upper streams and 21.8 frogs in lower streams. That result
makes much more sense!
#back transform our data log base 10 transformed0. If done with natural log, you would do
exp() function
transformed_upper_mean <- 10^upper_mean
transformed_upper_CI_minimum <- 10^(upper_mean-upper_CI)
transformed_upper_CI_maximum <- 10^(upper_mean+upper_CI)
The code below shows the inverse of various data transformations. Note that these
transformations are reciprocal. Meaning that to back transform square root transformed
data you would square the data, and to back transform squared transformed data you
would take the square root of the data.
There are three questions you should ask before running any statistical test:
T-test
T-tests are a statistical analyses that simply tell us if the differences between 2 means
are significant or not. With t-tests we could test for differences in mean limb length
between 2 rodent populations, daily temperature between 2 plots of land, or the time
spent foraging between a control and an experimental treatment.
Assumptions:
Homogeneity of variances: the variances of the groups being compared are equal
(for unpaired t-tests)
Unpaired T-test
In R, the t-test is run using the function t.test(). We’ll set paired = FALSE for the
unpaired t.test and paired = TRUE for the paired. The variable we put first
(body_mass_g1) will be considered the X variable, while the variable in the second
position (body_mass_g2) is considered Y.
set.seed(123456)
body_mass_g1 = rnorm(100,60,12)
body_mass_g2 = rnorm(100,45,12)
##
## Welch Two Sample t-test
##
## data: body_mass_g1 and body_mass_g2
## t = 9.3705, df = 197.98, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
##95 percent confidence interval:
##12.41382 19.03149
## sample estimates:
## mean of x mean of y
## 60.20184 44.47918
The results will show us the p-value, the confidence interval, and the means of each
group. The t-value and degrees of freedom are used to calculate the p-value.
In short, the T-value tells us the amount of difference between the groups. The larger
the score, the more differences between the 2 groups. The closer to 0, the less
differences between groups. In conjunction with our degrees of freedom, R calculates
the p-value. Remember, we’ll use a cutoff of .05 for significance.
The confidence interval shows you the interval for the mean differences between the
samples. We can find the mean difference simply by subtracting one sample mean from
the other (in this case mean of X is 60 and mean of Y is 44). Remember, our null
hypothesis is that this difference would be equal to 0.
effsize::cohen.d(body_mass_g1,body_mass_g2)
## Cohen's d
##
## d estimate: 1.325185 (large)
## 95 percent confidence interval:
## lower upper
## 1.017207 1.633163
All in all we would interpret our results as such: “Given our low p-value we are able to
reject the null hypothesis and provide evidence for a significant difference between
mean body mass in the 2 groups. We find a large effect size provided by Cohen’s D
indicating a substantial difference between the two groups, with a mean mass of 60g for
group 1 and a mean mass of 44g for group 2. We find a mean difference of 16g with a
95% confidence interval of 12g-19g.”
Paired T-test
Now lets do an Paired t-test. Instead of body mass, we’ll measure mean foraging time of
Mule deer on 10 plots of land before and after the plots undergo a prescribed burn. As a
reminder, this a paired t-test because the we are measuring the same variable (foraging
time) on the same plots of land. Thus our test is dependent on those plots.
Our code is almost identical, however this time we have set paired = TRUE to indicate
we want to run a paired T-test.
set.seed(123456)
foraging_time_preburn = rnorm(100,10,2)
##
## Paired t-test
##
## data: foraging_time_preburn and foraging_time_postburn
## t = -2.8538, df = 99, p-value = 0.005261
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
## -1.4910978 -0.2680168
## sample estimates:
## mean difference
## -0.8795573
Our results output contains largely the same information as our Unpaired t-test. A t-
value, degrees of freedom, P value, and Confidence Intervals. the only difference is that
we do not get the mean of each group. Rather we get the mean difference between the
groups. We could of course get the mean for each group by running
mean(foraging_time_preburn) in R, but lets continue on with our effect size calculation.
effsize::cohen.d(foraging_time_postburn,foraging_time_preburn)
##
## Cohen's d
##
## d estimate: 0.4448012 (small)
## 95 percent confidence interval:
## lower upper
## 0.1624883 0.7271141
And now we’ve done a full t-test! We should take note of a few things though before we
make our interpretation. Our data is significant (p value is less than .05) however, our
effect size is considered small. Remember back to the first section of our course when
we talked about the difference between significant and meaningful results?
This is a case where you would need to approach these results cautiously. There is a
detectable difference with a sufficiently low p-value, but how meaningful is that
difference? Statistically, its a very small difference. Any smaller it may even be
Linearity: the relationship between the two variables being correlated is linear
Homoscedasticity: the variance of the residuals is constant across all levels of the
predictor variable(s)
set.seed(123456)
cor.test(riverspeed,turtle_length)
##
## Pearson's product-moment correlation
##
## data: riverspeed and turtle_length
## t = 8.0591, df = 98, p-value = 1.897e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4964851 0.7364323
## sample estimates:
## cor
## 0.6313361
Our output provides with the t-value, the degrees of freedom and a p-value that we can
interpret in the same way as our t-test output. We then get a 95% confidence interval for
our correlation coefficient shown directly below under cor. The correlation coefficient,
abbreviated as r, is what tells us how strong or weak our correlation is. i.e. it is a
measure of effect size! r can be positive or negative and is on a scale from -1 to 1. Its
distance away from 0 indicates the strength of measurable effect. e.g. a value of .2 is
considered a weak positive correlation while a value of .9 is a strong correlation. Of
course, this works for negative correlations as well with -.2 being a weak correlation and
-.9 being strong.
See this example below to better illustrate effect size of correlations under different 4
values. In a perfect correlation (1 or -1), all the points fall on the line of best fit. However,
with weaker and weaker correlations (closer to 0), the points fall farther and farther
away from the line of best fit. Ultimately, the line with slope of 0 indicates No correlation
between the two variables.
Assumptions:
Homoscedasticity: the variance of the residuals is constant across all levels of the
predictor variable(s)
Null hypothesis: There is no relationship between X and Y. The slope of the line is 0.
Alternative hypothesis: There is a relationship between X and Y. The slope of the line is
not 0.
Linear models can be used to make predictions about the dependent variable based on
the value of the independent variable, and can also be used to test hypotheses about
the relationship between the variables.
In simple linear regressions, we are trying to calculate the equation of the line:
y = b0 + b1*X
set.seed(123456)
#Null Hypothesis: Slope is equal to 0#Alternative Hypothesis: Slope is not equal to 0
riverspeed <- rnorm(100,7,1)
turtle_length <- riverspeed + rnorm(100,30,1)
##
## Call:
## lm(formula = turtle_length ~ riverspeed)
##
## Coefficients:
## (Intercept) riverspeed
## 31.4482 0.7874
summary(model)
##
## Call:
## lm(formula = turtle_length ~ riverspeed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.59387 -0.50706 -0.00978 0.63399 2.08698
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 31.44816 0.69236 45.422 < 2e-16 ***
## riverspeed 0.78743 0.09771 8.059 1.9e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9659 on 98 degrees of freedom
## Multiple R-squared: 0.3986, Adjusted R-squared: 0.3924
## F-statistic: 64.95 on 1 and 98 DF, p-value: 1.897e-12
The first lines of the model summary show us the formula used. Then we have our
residuals. Residuals describe the difference between our observed and expected
points. If we remember back to our correlation section, if we have a perfect correlation
where all points fall on the line of best fit, our residuals would be 0, as there is no
difference between the expected and observed. However, as the relationship between
The coefficients section is the main section for understanding our model.
Mathematically, coefficients are values that estimate unknown parameters of a
population. In a simple linear regression with 1 independent and 1 dependent variable,
we will have 2 coefficients. Under the coefficients section we have that coefficient as
well as the standard error, the t-value, and the p-value.
The first coefficient will be the intercept (a constant value) that simply tells us where the
line of best fit intersects on the y axis. Here we see our intercept is at 31.448 and it is
significant! We’ll expand on this in just a second.
The second coefficient is the slope of the data, which tells us how many units our
dependent variable changes for every 1 unit change in our dependent variable. In this
plot(riverspeed,turtle_length)
abline(lm(turtle_length ~ riverspeed))
Lets first talk about our intercept. The intercept is simply where the value crosses the y
axis. In many biological cases, it will be significant EVEN IF THE INDEPENDENT
VARIABLES DO NOT AFFECT THE MODEL. This is critical when understanding our
models. The null hypothesis for this test is that the “slope does not differ from 0”. Of
course, it makes biological sense that our intercept would not be at x=0. After all, our
turtle measurements are between ~25-35. None of them are at the 0 line. In many
cases, you would interpret the intercept in terms of what makes sense biologically.
Now we can clearly see that when X is equal to 0, turtle length would be estimated at
around 31.4.
Lets also revisit our estimate of B1, the slope of our regression! Here the p-value is
significant, telling us that our independent variable has a significant effect on our
dependent variable. This directly relates back to our hypotheses where our null
Now lets talk about the effect size, we would use the correlation coefficient (r) and R^2
to describe the effect size. We already covered r during correlation, but what is R^2?
cor(riverspeed,turtle_length)^2
## [1] 0.3985853
Simply put, it is r squared! This results in a value from 0-1 that quantifies the % of
variation in our dependent variable that can be explained by our independent variable.
Here our R^2 of .39, can be interpreted as 39% of the variation in turtle size can be
explained by water speed. Generally speaking R^2 values under .5 are considered
weak, between .5 and .7 are considered moderate, and .7-1 are considered strong
effect sizes.
We can also see this information in the last section of our model output (including an
adjusted R squared for small datasets). There is also where we can find a p-value for
the overall model. In a simple linear regression, this is typically the same as the slopes
p-value, but for models with more than 1 variable this will be different.
Normality: the data is normally distributed within each group being compared
Independence: the observations are independent of one another within each group
being compared
set.seed(123456)
df <- rbind(g1,g2,g3)
Lets first make a boxplot to understand what our data looks like. Notice that our boxplot
function uses the same formula as a linear model.
From this data its a bit unclear if there are any significant differences. Lets run our
ANOVA to get a better idea of the data. Here we use the function aov() and input our
variables similarly to our linear models in the last section. We’ll also use summary() to
get a full idea of our data.
summary(anova)
set.seed(123456)
df <- rbind(g1,g2,g3)
summary(anova)
Notice how different the sum of squares is in this example. For the grouping variable
habitat, which again now has no real differences between the groups, the sum of
squares is only 117, compared to the residuals 28,899! An extreme discrepancy! In this
We have a significant p value! Meaning that there are significant differences between
groups in our data! Hooray! But lets go back to our hypotheses:
Null: There is no differences in the means between groups
Alternative: There is a differences in the means between groups
These hypotheses are simply stating that there are or are not differences in the means
between groups, it does not specify which groups have differences nor does it
specify how those differences appear.
That’s why we need a post-hoc test. A post-hoc test literally means a test you run after
the fact. Assuming that there is an overall significant difference in the ANOVA, we can
run Tukeys Post-hoc test to see what significant differences there are between groups.
This is easy enough. with the TukeyHSD() function.
#summary(anova)
TukeyHSD(anova)
This test will run statistical tests for each pairing of groups in our ANOVA. Its output
gives us the differences in the means between the 2 groups, the lower bounds, the
upper bounds, and a p-value for each comparison. Note that 0 means the p value is
extremely small (.00000000000001 for example), and thus we could reject the null at
our .05 cutoff.
Assumptions:
Both variables are categorical – The numbers utilized are the counts of those
variable pairs (e.g. Juvenile birds at site 1, adult birds at site 2 etc etc)
Mutually exclusive cells – This means that every count is mutually exclusive to a
single cell.
When running a chi-squared test over these data, our hypotheses are:
Null: There is no difference between the expected and observed counts
Alternative: There is a difference in the expected and observed counts
In our example, we create a simple contingency table around birds that were
categorized according to two variables: plumage color and sex. Each individual is coded
set.seed(123456)
#Null: The observed counts do not significantly differ from expected#Alternative: The obse
rved counts do significantly differ from expected
g2 <- c(rep("MALE",500),rep("FEMALE",500))
table(df)
## g2
## g1 FEMALE MALE
## Green 376 314
## Red 124 186
We use the table() function to create our contingency table based on our data. We can
see that our data set has more Green individuals (376+314 = 690) than red individuals
(124+186 = 310). Additionally, we have equivalent numbers of Female (376+124 = 500)
and Male (314+186 = 500) birds. Now the question is, are there any significant
differences between these groups? In essence, do we have any combinations of color
and sex that are observed significantly more or less than we expect?
We’ll first use chisq.test() to get at this question.
chisq.test(table(df))#rdefault
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: table(df)
## X-squared = 17.396, df = 1, p-value = 3.035e-05
#Lets do CrossTable from the Gmodels package to get a better understanding of our data#Thi
s function does not require a contingency table, we can supply the data directly#We will w
ant to set expected = TRUE. This does 2 things:#'#'1) It shows us the expected values from
the dataset. Useful for interpretation!#'2) IT runs the ChiSq test. We can also set Chisq
= TRUE if we want instead#'#' We then want to set sresID = TRUE. This will show us the st
andardized residuals as a Z-score, allowing us to tell what groups are significant. We use
the cutoff +-1.96#' Then we add format = "SPSS" This formats in a way that we can actually
see the residuals#'#install.packages("gmodels") #Needed if you havent installed the gmodel
s package yet
##
## Cell Contents
## |-------------------------|
## | Count |
## | Expected Values |
## | Chi-square contribution |
## | Row Percent |
## | Column Percent |
## | Total Percent |
## | Std Residual |
## |-------------------------|
##
## Total Observations in Table: 1000
##
## | df$g2
## df$g1 | FEMALE | MALE | Row Total |
## -------------|-----------|-----------|-----------|
## Green | 376 | 314 | 690 |
## | 345.000 | 345.000 | |
## | 2.786 | 2.786 | |
## | 54.493% | 45.507% | 69.000% |
The output is quite large but we can break it down step by step. Each cell in the table
corresponds to one of the variable pairs (e.g. Male-Green, Female-Red). At the top of
our output we have the key for the cell contents. We can see the observed and
expected counts of our dataset (again, this is what Chi-Squared is testing).
Then we have the percent contribution for rows, columns, and the total dataset. e.g. If
there were 10 observations in a row of 20 total observations, its row contribution would
be 10/20 or 50%!
Then we have our standardized residuals. Remember back to our sections early on in
the course about probability distributions and standard deviations. We said that values
falling over 1.96 (rounded to 2) or under -1.96 (rounded to -2) are considered
significantly significant at a confidence interval of 95%. These standard residuals are
based directly on this cutoff and we can use them to determine significant groups!
Based on the results of our Chi-Squared test (X^2 = 17.396; P <.0000001) we detect a
significant association between bird coloration and their sex. We are able to reject our
null hypothesis as we have evidence for the alternative that expected counts
significantly differ from the observed. Using our standardized residuals we find no
significant difference between green coloration and the sex of the bird. In essence,
Green birds are equally likely to be female (SRES = -1.669) or female (SRES=1.669).
However, Red birds are significantly less likely to be females (SRES = -2.49) and vice
versa more likely to be male (SRES = 2.49). There may be some type of sexual
selection or other behavioral differences between the sexes that creates this pattern,
however the exact reasons are beyond the scope of this analysis.
Bartlett Test - Bartlett test is a statistical test used to assess the homogeneity of
variances across different groups or levels of an independent variable. It tests the null
hypothesis that variances are equal between groups.
Bimodal Distribution - A distribution with two distinct peaks or modes.
Binomial Distribution - A probability distribution modeling binary outcomes like yes-no
or success-fail for discrete data
Categorical Variable - A variable that represents categories or groups.
Central Limit Theorem - A statistical theory stating that the distribution of sample
means approximates a normal distribution.
Chi-Square Test - A statistical test used to analyze contingency tables and assess
associations between categorical variables.
Coefficient of Variation (CV) - A relative measure of variability calculated as the ratio
of the standard deviation to the mean.
Coefficients - Values that estimate unknown parameters of a population in a regression
model.
Cohen's d - Effect size measure indicating the difference between two means.
Confidence Interval (CI) - A confidence interval (CI) is a range of values that is likely to
contain the true population parameter with a specified level of confidence. Common