Fundamental Biostatistics Dillon Jones

Fundamental Biostatistics for
Biologists
By Dillon Jones
Preface
Howdy! The text below provides a conceptual understanding of statistics with a specific
focus on analyzing biological data in the R coding environment. While teaching
biostatistics at the university level, I found that my students had 2 main struggles: not
understanding core statistical concepts and making simple errors in R. This text aims to
tackle the former problem. If you’re interested in solving the latter, check out my
Fundamentals of R for Biologists material.
Interestingly, I found that mathematical knowledge was rarely a barrier. This realization
surprised me as statistics has traditionally been taught under a math focused
curriculum. Memorizing equations, plugging in variables, and keeping track of often
confusing symbols.
In my opinion, this is why statistics, despite being so critical to all our scientific
endeavors, is often the weakest skill among scientists.
This text, Fundamental Biostatistics, gives its readers the conceptual understanding
needed to actually understand statistical theory and then apply that knowledge to
biological datasets. Mathematics is largely ignored in this text in favor of letting R run
the calculations and statistical tests.
For many, understanding basic statistical concepts and being able to interpret results
using those concepts is all that is ever needed. After all, with tools such as R, online
calculators, and collaborators from the statistics department, many scientists don’t
require in-depth statistical analyses to run and interpret simple t-tests, correlations, and
linear regressions. This book is tailor suited for those individuals.
However, for those who do want to understand the mathematics, it is nearly impossible
to do so without understanding the concepts laid out here. If this is you, this text should
serve as a spring board into more advanced discussions of statistics.
Fundamental Biostatistics for Biologists 1

One last note before I set you loose on your statistical journey. This text was written for
the Fundamental Biostatistics self-paced course found on learnadv.com. While the
material between both this text and the course overlap dramatically, the course offers
video explanations for every section contained here as well as quizzes and final
mastery check that have you apply your newfound statistical knowledge example
datasets.
I hope you enjoy and continue to Learn Adventurously!
-D
Introduction to statistics
What is Statistics?
Statistics is a branch of mathematics that involves collecting, analyzing, and interpreting
data. It provides the tools that let us make sense of data, interpret results from
analyses, and test hypotheses. In the context of biology, statistics is used to design and
analyze experiments, draw conclusions from biological data, and communicate our
results to others.
This course covers the basics of statistics with a focus on a conceptual understanding,
rather than a mathematical understanding. We will cover common statistical tests used
in biological studies and their interpretation.
Although understanding the math is important to fully comprehend the statistical

analyses, grasping the concepts and theory is the most important first step. The math
will make much more sense once you understand the concepts.
Lastly, statistics can be performed in a wide range of software and even by hand.
However, this course covers statistics using the R programming language. While we do
not have time to cover the basics of R, we will keep the explanations and code as
straightforward as possible.
Hypothesis testing
Hypothesis testing is an essential component of statistics and a proper understanding is
critical to understanding our statistical analyses. Typically, we utilize a null and

alternative hypothesis.
Conceptually, a null hypothesis states that “there is nothing going on.” For
example, there is no relationship between tree height and photosynthesis, or water
quality does not affect larval amphibian development, or there are no significant
differences in species composition between forest and grassland habitats.
On the other hand, alternative hypotheses state that “there is evidence of
something going on.” We can create alternative hypotheses from the null hypotheses
presented earlier: Tree height does have an effect on photosynthesis, water
quality does affects larval amphibian development, or there are significant differences in
species composition between forest and grassland habitats.
Hypothesis testing aims to provide evidence for the alternative hypothesis and reject the
null hypothesis. However, we cannot prove either hypothesis to be completely true or
false. Instead, we have enough data, and thus evidence, to support or not support the
alternative.
Intro to Probability Distributions

Probability is a branch of mathematics concerned with measuring the likelihood of
events. In statistics, probability is used to make inferences about populations based on
samples of data. Probability is typically represented as a number between 0 and 1, with
0 indicating an impossible event and 1 indicating a certain event.
We know that if we flip a coin the probability of getting heads or tails is .5. Rolling a
single die, we know the probability is 1/6 for each side. As a reminder, we know that all
these are independent of one another. Just because we flip three heads in a row, does
not mean that we have a greater likelihood of flipping a tails on the next flip
If we take all of those probabilities of an event occurring (for flipping a coin there are 2
events: flipping a heads or flipping a tails) and plotted them, we would have a probability
distribution. These probability distributions are a critical part of understanding statistics.
Probability distributions can operate on discrete data, where measured values cannot
be subdivided (such as number of individuals), or on continuous data, where values can
be subdivided or measured with greater accuracy (such as length).
Lets show some examples. On a probability distribution the y-axis indicates the
probability density, or how probable a given event is, and the x-axis corresponds to
some outcome.

If we had a probability distribution for rolling a single die we would have a discrete
distribution, as you can only roll a 1, 2, 3, 4, 5, or 6, and it would be uniform as each
value has equal probability of being rolled.
A discrete probability distribution of rolling each number on a single die
But let’s show a more biological example.
Below is a figure displaying the # of bird species detected in 100 different plots. Some
plots had 2 species while others had 3 species or 4 species and so on and so forth. We
can then create a probability distribution that shows the proportional amount of times
that each # of species was detected. In the example below, 30 out of 100 plots had 5
species, so it would have a proportional frequency of .3.

Since you can’t have half a species, this distribution is considered discrete. From this
distribution, we can easily calculate the probability of detecting X number of species in a
plot. For example, we have a probability of .3 (30%) for a plot to have 5 bird
species. This is the same as its relative frequency!
But we are not limited to single values! We can ask “What is the probability of a plot
having 5 or more bird species?”
We can actually sum the probabilities of having 5, 6, and 7 bird species together and
come up with .6 (60%)!

Naturally, this same concept applies to continuous probability distributions. Let’s say you
measure the peak activity time of all those bird species and utilize time as a continuous
variable. Time is continuous here because you can always subdivide it into smaller and
smaller units. As another example, if we were measuring length we could measure in
terms of meters, centimeters, millimeters, decimeters and so an so forth. With
continuous data we can always subdivide or measure at greater accuracy.
Here our peak time was found to be 0600 (6am) and we expect time to peak activity to
slowly ramp up, before slowly ramping back downwards. With continuous distributions,
it’s often not useful to take the probability of a single point, because there are far too
many subdivided units within the distribution. If we ask “What is the probability of peak
activity being at precisely 0600?” we will find a very small probability. Who’s to say peak
activity isn’t at 0601 (6:01 am), or 0600:01 (6 am and 1 second), or 0600:00:25 (6 am
and 25 milliseconds)?

As such, we typically use ranges of data with probability distributions. We can ask
“What is the probability of peak activity being between 0600 and 0800?” or “What is the
probability of peak activity time occurring before 0700?”. In order to find those
probabilities, we take the area under the curve!

It may help to think of a continuous distribution as a discrete distribution but with an
infinite number of values. If every single possible value between 0600 and 0800 was
represented by its own bar, we could add each value’s probability together exactly the
same as a discrete distribution. Of course, this is not easy to do by hand. Luckily, we
have computer software to do that for us!
Standard Deviations and Confidence Intervals

The standard deviation is a measure of variation in a set of data. It tells you, on
average, how far away each data point is from the mean. A larger standard deviation
means that the data is more spread out, while a smaller standard deviation means that
the data is more tightly packed around the mean.

Under a normal distribution, we can use the 68-95-99 rule to understand how much of
our data is captured under the standard deviation. From 1 standard deviation in each
direction from the mean, 68% of our data is captured, 95% is captured under 2 standard
deviations, and 99% of our data is held under 3 standard deviations away from the
mean.
Note: Technically the 68-95-99 rule is actually 68.27% of data lies within 1
standard deviation of the mean, 95.45% within 2, and 99.73% within 3. We are
rounding here for the sake of clarity. In most scenarios, those extra few tenths of
a percent are not very impactful.
Standard deviation is commonly used in statistical analysis to assess the variability of a

data set. It can also be used to calculate confidence intervals, which are a range of
values that is likely to contain the true value of a population parameter with a certain
level of confidence. For example, a 95% confidence interval for a population mean
would imply that if we were to take many samples from the same population and
calculate the mean from each sample, 95% of those sample means would fall within the
confidence interval.

In R, we use the function sd() to calculate standard deviation!
P-values - What does it mean to be significant?

A p-value is the most common measurement of significance in statistics. It is a number
between 0 and 1 that represents the probability of obtaining a particular result by
random chance. As such a lower number is a better indication that differences in our
data are real! Significance can be thought of as “Do we have enough data to make the
claim?” As we collect more and more data, we can be more and more sure that our
results are not because of random chance.
The smaller the p-value, the stronger the evidence against the null hypothesis and the
more confident we can be in rejecting it.
In most biological studies, we use a P-value threshold of .05. Meaning, that at a p value
of .05 we would have a false positive from random noise about 5% of the time.
Any value equal to or lower than .05 tells us that our data is significant and we are able
to reject our null hypothesis. This .05 value is linked directly to the common 95%
confidence interval. Keep in mind that this threshold is arbitrary, and many statisticians
recommend using a stricter threshold such as .01.
The p-value and significance is also the most misunderstood metric in statistics. As we’ll
discover in our next lesson, significant results do NOT mean meaningful results. Rather,
they tell us “we have enough data to detect a difference”.

Effect Size - What makes our data meaningful?
While metrics like the p value will tell us if the results are significant or not, the effect
size tells us if the values are meaningful.
Say we went and measured the toe length from 2 populations of lizards. After collecting
thousands of measurements, we find a significant difference between the two
populations.
But are the difference between populations actually meaningful?

Say our control group has a toe length of 50 mm. If the second population has a mean
toe length of 51 mm, that may not make much of a difference biologically. In essence,
the results are significantly different, but not biologically meaningful.
But lets say instead that the population differences were still significant, but the mean
toe length was instead 60mm. That difference might actually be meaningful! In the
context of lizard toe length, generally speaking, the more arboreal a lizard is, the longer
their toes. We might actually be able to detect that these lizards with longer toes are
indeed more arboreal!
While both results allow us to reject the null hypothesis, because they are both
significant, we would use our understanding of the biology to say that one is biologically

meaningful while the other is not. With enough data, we will almost always detect
significant differences between populations.
Given this, how do we determine how meaningful a value is?

That is where effect size comes into play.
Effect size measures the magnitude of difference between populations or the strength of
a relationship between two variables. It helps us understand whether a statistically
significant result is also meaningful. Understanding effect size can help us interpret
statistical results in the context of the research question and real-world implications.
Cohen’s d is a measure of effect size that describes the difference between two
means. Generally, a Cohen’s d of 0.3-0.5 is considered a small effect size, 0.5-0.8 a
moderate effect size, and 0.8 or higher a large effect size. Cohen’s d is calculated by
taking the difference between the means of two groups and dividing it by the pooled
standard deviation.
With that calculation, we can say that a Cohen’s D of 1, indicates that the mean
difference between populations is equal to 1 standard deviation. A Cohen’s D of .5, tells
us that the mean difference is only half a standard deviation.
There are many metrics that can be used for determining effect size and they depend
on which statistical test you are performing. Another common one is the correlation
coefficient. We’ll cover it in depth during our correlation section, but it is a value from -1
to 1 that tells us the direction and strength of the relationship between 2 continuous
variables. We can show that relationship according to how tightly clustered points are to
a line of best fit.

The closer that r value is to 0, the weaker (and less meaningful) the relationship, and
naturally the closer the value is to 1 (or to -1 for a negative correlation) the stronger (and
more meaningful) the relationship. For example, a value of .2 indicates a pretty weak
relationship, even if the data are significant. Conversely a value of .8 (closer to the max
of 1) indicates a rather strong relationship!
Understanding our data

Types of Data Distributions
Understanding the different types of distributions is important for running statistical
analyses. Each type of distribution has different properties, shapes, and types of data
that can apply to it.
In this course, we are going to show you 3 common types of distributions: Binomial,
Normal, and Poisson. In your studies, you may need to use different types of
distributions, but many more advanced distributions are based on these 3.
Binomial Distribution
The binomial distribution describes the frequency of “successful outcomes” given some
dependent variable. Owing true to it’s name, binomial distributions are binary in that
they describe only 1 of 2 outcomes: yes-no, success-fail, presence-absence. Binary

data is technically a form of discrete data (cant have half a no) and the binomial
distribution models the probability of only 1 of these outcomes. The probability of
success, or failures, or yes’s, or no’s. Not the probability of Yes AND no.
Along the x axis, we describe the number of successful (or unsuccessful) trials during
our study. A biological example may be probability of survival for individuals per year.
We could have the probability of surviving 1 years, 2 years, 3 years etc. etc. until we
reach a known maximum. This distribution is binomial because at each year, the
outcomes are only 1 of 2 options: survive or don’t survive. Another example may be the
presence (or absence) of a particular gene in multiple populations, where on the x axis
we would have the number of individuals in each population with that genotype.

Poisson Distribution
The Poisson distribution is a discrete probability distribution that models the mean
occurrence of events over time or space (i.e. count data). While similar to the Binomial
which measures the probability of successful events, the Poisson measures
the number of events in a specified time or space. Further, the binomial has only 2
outcomes, while the Poisson has no limit to the number of potential outcomes. The
Poisson is often used to model the number of individuals in a population, the number of
mutations in a genome, or the number of rare events in a time interval.

The Poisson distribution is characterized by a single parameter, lambda, which is equal
to both the mean and the variance of the distribution. Lower levels of lambda typically
cause the data to be right skewed, while higher levels of lambda become equivalent to a
normal distribution.
Normal Distribution
The normal distribution is a continuous probability distribution that is often used in
biology as well as many other fields. The normal distribution is characterized by its bell-
shaped curve, with the mean, median, and mode all being equal. Many biological
measurements, such as the height of individuals in a population or the weight of seeds
produced by a plant, can be modeled using the normal distribution. The normal
distribution is important in statistical analysis because many statistical tests rely on the
assumption that the data is normally distributed.

One thing to note is that we often use the normal distribution even for data that is not
continuous. Remember how we said at sufficient levels of Lambda, the Poisson
distribution (which models discrete data) can be seen as equivalent to the normal?
Keep in mind that these distributions are used to model data, not describe them
perfectly. While many statistical tests assume one type of distribution or another, there is
a bit of wiggle room in these assumptions (as we discuss later). This course will not go
into these situations, but you should keep this property of distributions in mind while
continuing your own statistical journey!
Understanding Assumptions about our Data

Assumptions for statistical tests are necessary because they ensure that the results of
the test are reliable and valid. For example, many statistical tests are designed under
the assumption that the data is normally distributed. If this assumption is violated, the
results of the test may be incorrect or misleading. By ensuring that the assumptions of a
statistical test are met, we can be more confident in the results and their interpretation.

How do we determine if our data meets the assumptions? This is realistically more of an
art than it is a science. To assess if your data meets the assumptions of the statistical
test, you will need to look at specialized graphs, run other statistical tests, and in some
cases, just play around with your data. The next few lessons will cover common data
assumptions and show you how to assess them in the R programming environment.
Normal distribution
The normal distribution assumption is a common assumption in statistical tests. As a
reminder, a normal distribution is continuous data that where the mean, median, and
mode are all equal typically displaying as a bell-shaped curve. Normally distributed data
is probably the most important assumption as many statistical tests are designed with
this distribution in mind.
Normality can be checked in R through visual checks like QQ-plots or statistical tests
like the Shapiro-Wilk test. Here we will break down how to interpret each of these tests.
A QQ-plot can be generated for continuous data in r using the following code:
#Generate normally Distributed Data

set.seed(123456)
x <- rnorm(1000,100) #Generate a normal distribution of data
#plotting the density of x gives us our data distribution

plot(density(x), main = "Data Distribution")#plot with densities
#Create the QQ Plot

qqnorm(x) #qqplot for normality.
qqline(x) #Add the QQplot line
Below, we have normally distributed data. We’ll talk about the Shapiro Wilk Test in just a
moment, so for now look at the shape of the data on the left and the QQ-plot on the
right.

Without running any statistical test, we could probably assume that this data is normally
distributed. Why is that?
Well the plot on the left looks approximately normal. There is no major skew in the data,
there is only 1 peak, and there are not odd looking outliers. Additionally, on our QQ-plot
we see all the points falling on a straight line without major deviations. a few points at
the upper and lower ends of the X axis do deviate away from the line. This is normal.
Keep in mind, this is simulated data, and your real world data is likely to be even
messier than this. Distributions are rarely perfect and in reality, there is some “wiggle
room” as to which distributions would be considered normal.
Lets compare this to a dataset that is not normal
set.seed(123456)
#Non-normally distributed data

y <- vector()#set the vector
for(i in 1:1000){#simple function for pulling a random number and adding it to i. This dat
a will not be normal.
y[i] <- runif(1,0,100)+i
}
plot(density(y), main = "Data Distribution")#plot with densities

qqnorm(y) #qqplot for normality.
qqline(y) #add the qqline
Here we have our first clue as to if our data is normal or not. The plot on the left shows
multiple peaks and we suspect is not normal. Looking at the QQ-plot reveals that many
points deviate away from our line, presenting as an s-shaped curve. These 2 lines of
evidence indicate that the data may not be normally distributed.
But, looking at plots is rather subjective and 2 different scientists could come to different
conclusions about the same data. Lets talk about the Shapiro-Wilks test.
This test is very easy to run and gives us an objective way to determine if some data is
normally distributed or not. But many newcomers interpret the results from the test
incorrectly. Lets recall our hypothesis testing and p-value interpretations.
The Null hypothesis for the Shapiro-Wilk test is that “The data is normally distributed”
The Alternative hypothesis for the Shapiro-Wilk test is that “The data deviates from a
normal distribution”
Thus, we want our p-value to NOT be significant (p >.05) if we are looking for normally
distributed data. The below code runs the Shapiro-Wilk test for each data x and y
above.

#Test on normally distributed data
shapiro.test(x)
## p-value = .7595
#Test on non-normal data

shaprio.test(y)
## p-value = <.00000001
Our graphs above already contain the p-values for each data in the title. We can see
that the normally distributed data has a p-value of .7595. Far from our significance cutoff
of .05, indicating that the test is insignificant and we fail to reject the null hypothesis.
The data is normal
The second graph has a p-value far below .05 (<.00000000001), indicating that our
results are significant and we can reject the null hypothesis. The data is NOT normal.
Using visual inspections of the data, QQ-plots, and Shapiro-wilk test enable us to see if
the data is normally distributed or not. If the data is not normal, data transformation is
often the first step and is explained in a future lesson. We can also use non-parametric
tests, which are tests that do not require a normal distribution. While not covered in this
course, see the bonus material section to find more information on these types of tests.
Homogeneity of variances
Homogeneity of variances assumes the variance of the dependent variable is equal
across different groups in our data. In biostatistics, this assumption is important because
many statistical tests, such as ANOVA, require that the variance of the dependent
variable is equal across all groups being compared.
e.g. lets say we are analyzing average height for 3 different populations of Snake plant.
The ANOVA test, which compares means between groups may be well suited for this
analysis, but it assumes that the variation of that height is equal between each
population. The mean height for each group might differ, but the variation around those
means should not.
We’ll explore Homogeneity of variances by examining our data, look at Residuals vs
Fitted plots and the Bartlett’s test.
The below code creates normally distributed data split into 3 groups with equal
variance. The Residuals vs Fitted plots can be obtained by plotting an ANOVA output

(the aov() function in the code below). We’ll also create a boxplot to show the data
directly
#Normally distributed, equal variances

set.seed(123456)
df <- data.frame(ID = 1:100,
plant_height_mm = c(rnorm(33,30,5), rnorm(33,20,5),rnorm(34,40,5)),
group = c(rep.int("A",33), rep.int("B",33), rep.int("C", 34)))
plot(aov(df$plant_height_mm ~df$group),1) #see residuals vs fitted

boxplot(df$plant_height_mm ~ df$group) #see boxplot
Looking at the boxplot on the right, we can see that the variances for each group look
approximately equal. Specifically, we look at the whiskers of the boxplot and see that
each pair spans approximately the same distance.
The residuals vs fitted plot on the left shows a similar pattern. Each cluster of points
(seen at x = ~ 19, ~31, and ~40) relates to each of our groups of data. Those points
describe the spread of data for each group, while the red line follows the center of each
group.
Here we look for 2 things:

1) The red line for the residuals is approximately straight and approximately equal to 0
for each group
2) The points around the red line have approximately the same spread between groups.
While the plots above indicate homogenous variances between groups, lets look at data
which is NOT homogenous.
#Normally distributed, nonequal variance.
set.seed(123456)
df <- data.frame(ID = 1:100,

plant_height_mm = c(rnorm(33,30,1), rnorm(33,20,5),rnorm(34,40,10)),
group = c(rep.int("A",33), rep.int("B",33), rep.int("C", 34)))
plot(aov(df$plant_height_mm ~df$group),1) #see residuals vs fitted

boxplot(df$plant_height_mm ~ df$group) #see boxplot
We can clearly see that there are different variances between the groups in our boxplot.
We should also be aware that each of the groups have the same mean as their
counterparts in the previous graph (e.g. Group A has a mean of 30 for both
homogenous and non-homogenous data). The only difference between these datasets
is the variances.

On the Residuals v fitted plot we can see that the red line starts to deviate away from 0
and the spread of points around each group varies dramatically.
Of course, these are subjective takes. And while these examples were created to
demonstrate those differences, real world data is often more messier and trickier to
discern. Just like the Shapiro-wilk test for normality, we can test for Homogenous
variances by using the either the Bartlett or the Levene test. We’ll cover the Bartlett test
here, but the Levene test from the car package is useful for data that is not normally
distributed.
Both tests fortunately have the same hypotheses
Null: Variances are equal between groups
Alt: Variances are NOT equal between groups
The following code runs these tests over our data:
bartlett.test(data = df, plant_height_mm ~ group) #Null: variances are equal
car::leveneTest(data = df, plant_height_mm ~ group)#Null: variances are equal; Non-normal

data. From Car package
Running these analyses over our data, we get a p-value of .8826 for our first data set
meaning our data is not signficantly different and we fail to reject the null hypothesis.
Our variances are equal between groups.
For our second data set, the p value is <.0000001. Far below our alpha cutoff of .05.
Thus, our data is significantly different and we reject the null hypothesis. Our variances
are NOT equal between groups.
Independence
Independence assumes that the values being measured or observed are not related to
one another in any systematic way. Independence usually is accounted for during the
data collection process of the study by ensuring the data collected is not dependent on
other variables. However, many metrics are by their very nature non-independent. For
example, temperature and rainfall can be correlated if you are comparing temperate
forests to deserts.

We can check for independence by running some common statistical tests. For
example, if we are analyzing the relationship between the abundance of a certain
species and various climatic variables in a particular area, we assume that the climatic
variables we use are independent of one another. e.g. Monthly mean Temperature is
independent of monthly precipitation. If these factors are not independent, it can affect
our statistical analysis and lead to incorrect conclusions. With these types of data, often
we exclude one of the variables that are correlated with one another however, some
models can account for non-independence.
The code below creates a dataset of 4 variables, temperature, humidity, elevation, and
forest cover to test how they are related to species richness. We want to test if our
independent variables are actually independent of one another. We first decide to plot
the data.
set.seed(123456)
library(tidyverse)
#Create the dataframe. Elevation and forest_cover are intentionally dependent on other ind
ependent variables
df <- data.frame(richness = 1:100)%>%
mutate(temp = (richness/2)+runif(100,20,40),
humidity = (richness/10)+runif(100,60,90),
elevation = (temp*10)+runif(100,200,400),
forest_cover = (humidity*.1)+runif(100,20,90))
par(mfrow = c(2,2)) #This function makes it plot 4 plots at once.
#plot and understand our data

plot(df$temp,df$humidity)
plot(df$temp,df$elevation)
plot(df$temp,df$forest_cover)
plot(df$forest_cover,df$humidity)

We can see that most of our data interactions appear independent (data are randomly
distributed on our plots). However, temperature and elevation appear to have some
linear relationship, and thus may not be independent of one another. Lets run some
correlations to test this assumption. We cover correlations later in the course, but for
now, just assume that data with significant correlations are not independent of one
another if their correlation coefficient is .5 or greater. The correlation coefficient is a
measure for how correlated two variables are and is shown in the follow code chunks.
cor(df$temp,df$humidity) #Significant
## [1] 0.4872549 #Below our .5 cutoff. Independent.
cor(df$temp,df$elevation) #Significant
## [1] 0.9360444 #Above our .5 cutoff. Not independent.
cor(df$temp,df$forest_cover) #Not Significant

## [1] -0.1448379 #Above our -.5 cutoff. Independent. (note that for negative correlations
we use -.5 instead of .5)
cor(df$forest_cover,df$humidity) #Not Significant

## [1] -0.09998881 #Above our -.5 cutoff. Independent. (note that for negative correlatio
ns we use -.5 instead of .5)
Based on our results, we have 2 significant correlations. One being the interaction
between temperature and humidity and the other being temperature and elevation.
However, using .5 as our cutoff value, only the interaction between temperature and
elevation is considered not independent. Note that .5 is an arbitrary cutoff and can
change depending on your study.
The other interactions (Temp x forest cover and forest cover x humidity) are not
significantly correlated, meaning they are independent of one another. Of course, for a
simple dataset with only a few variables, running the correlation for each pair is fairly
straightforward.
We can also create a correlation matrix by supplying the cor() function with all of our
variables we wish to compare. This will run a correlation for every pair of variables.
Here, the values above our .5 cutoff are bolded. Note that the center diagonal line is all
1 and the correlations are mirrored across that line. Richness is also included, but
cor(df)
richness temp humidity elevation forest_cover
richness 1.00000000.91063480.502393390.8731024 -0.14054310
temp 0.9106348 1.0000000 0.487254860.9360444 -0.14483789
humidity0.5023934 0.4872549 1.000000000.5257696 -0.09998881
elevation 0.87310240.93604440.52576956 1.0000000 -0.11944915
forest_cover -0.1405431 -0.1448379 -0.09998881 -0.1194492 1.00000000
Another method is the VIF function. We supply the VIF function with a simple model and
from there it identifies non-independent data.
#Another function from the cars package, VIF makes this process much easier. Values above
5 are considered dependent on one another.
car::vif(aov(richness ~ temp+humidity+elevation+forest_cover, data = df))

## temp humidity elevation forest_cover
## 8.152483 1.385316 8.540263 1.025774

Values about 5 are considered not-independent. Here we see that temperature and
elevation are dependent. This makes sense of course, as elevation increases,
temperature tends to decrease. When identifying those variables, standard procedure is
to remove one of them and rerun VIF. Choosing which variable to remove depends
heavily on your study. If we are explicitly testing how temperature affects species
richness, it would be improper to remove the temperature variable. Instead, lets remove
elevation as it is beyond the scope of our study.
#We'll remove elevation and rerun the analysis
car::vif(aov(richness ~ temp+humidity+forest_cover, data = df))
## temp humidity forest_cover

## 1.327579 1.312855 1.022613
Notice how now temperature is below 5. This shows us that the variables are no longer
dependent on one another and we can continue to run our statistical tests! If our study
did want to see how elevation affects richness, we could rerun the analysis without the
temperature data.
What to do if assumption is violated?

If your data violates an assumption of a statistical test, there are a few things you can
do. One option is to transform the data to meet the assumption. For example, if your
data is not normally distributed, you can try transforming it using a logarithmic or square
root transformation. Another option is to use a different statistical test that does not
require the assumption, such as non-parametric tests that don’t require a normal
distribution (but do have other assumptions!)
But, all this being said, with a proper sample size, many statistical tests are quite robust
to violations of assumptions. If neither of these options is feasible, you can still proceed
with the test but be aware that you are violating assumptions and check the relevant
literature for your analysis to ensure your results are robust.
Data Transformations

Data transformations are often needed to meet the assumptions of the statistical tests.
Here is a quick guide to transforming your data:
1. Test if your data needs transforming: If your data is already normally distributed for
example, there is no need to transform it
2. Determine which transformation to use: Common data transformations include

logarithmic, square root, and reciprocal transformations. Most start with the log
transformation.
3. Transform the data: Apply the chosen transformation to the data. This can be easily
done in R.
4. Check the assumptions: After transforming the data, check if the assumptions of
your statistical test are met. If not, try a different transformation or use a different
statistical test.
Repeat if necessary until a data transformation fits your needs
5. Perform the statistical test: Once the data meets the assumptions of the statistical
test, perform the test!
The code below shows a typical workflow for transforming data that does not meet the
requirements for normality.
set.seed(123456)
x <- exp(rnorm(1000,10,2)) #create a non-normal distribution. We are doing the inverse of

the log to transform this data into something non-normal.
par(mfrow=c(1,2)) #this allows us to show 2 plots on the same window
plot(density(x), xlab = "Shapiro Test P value: <0.0001", main = "Not Normal Data") #plot
shapiro.test(x) #Shapiro test. Remember, nonsignificant is what we want
In the code above, we have some data that is heavily right skewed. Naturally, our
Shapiro-Wilk test for normality returns with a highly significant p value (.0001).
Remember, the null hypothesis for this test is “Data is normally distributed”, so by failing
to reject our null hypothesis we are say there is evidence for the alternative (Data is not
normally distributed).

Next we decide to transform our data to test if that particular transformation works.
Different transformations work for different types of data, so this step often requires
transforming our data, testing for normality again, and then determining if we need to
use a different transformation. Typically, heavily skewed data like ours is made normal
through a log transformation.
log_x <- log(x) #log transform the data

plot(density(log_x),xlab = "Shapiro Test P value: .7595", main = "Normal Data Log Transfor
med") #plot
shapiro.test(log_x) #Shapiro test, Not significant meaning it is normally distributed.
After log transforming the data with the code above and rerunning the shapiro-wilk test,
we get what appears to be a normal distribution! The data follows a bell shaped curve
and our p-value is not significant. If our statistical test requires a normal distribution, this
would be a very valid transformation to apply!

Of course, the log transformation is not the only transformation out there for data. The
code chunk below contains several other common data transformations in R.
x <- some_not_normal_data
log(x) #log transformation

exp(x) #exponential transformation (inverse of log)
1/x #reciprocal transformation
sqrt(x) #square root transformation
1og(x/1-x) #Logit transformation
Back data transformations

Back-transforming data involves taking the results of a transformed data analysis and
converting it back to its original scale. This is often necessary when we want to interpret
our results in a meaningful way. This allows us to interpret the results in terms of the
original units (e.g. # of individuals, grams, etc.) rather than on the transformed scale
(e.g. 1.3 log transformed individuals per acre).
The formula for back-transforming data depends on the type of transformation used. In
general, we do the inverse of the operation that we used to transform the data. For

example, if we square root-transformed our data, we would square the result to get the
back-transformed value. If we log-transformed our data, we would raise the result to the
power of e (the base of natural logarithms) to get the back-transformed value. It is
important to understand the specific transformation used in order to correctly back-
transform the data.
Lets give an example. Say we investigated the abundance of frogs at higher and lower
streams. To meet the assumptions of our statistical tests, we log transformed our
abundance data. We then found that our tests were significant and we are ready to
report our results!
While we could report the mean of each log transformed group, it wouldnt make much
sense! What exactly does the log transformed frog abundance mean anyway? We can
see there are MORE frogs in lower streams, but no intuitive understanding of how many
more.
This is where we would back transform the data
The back transform, it really is as easy as reversing the transformation. For example, a
log transformation in R can be undone with a exp() function as the log() function is a
natural log. In the table below, we do 10^x as our data transformation, as our log
transformation used base 10. In terms of abundance, this converts our mean
abundance to 17 frogs in upper streams and 21.8 frogs in lower streams. That result
makes much more sense!

But notice that with the confidence interval, we had to do a bit of extra work. It is
improper to just run the log transform on the confidence interval value calculated from
the log transformed data. Because the originally skewed data was transformed into a
normal distribution the variance around the mean is equal to the left and the right.
Essentially, the confidence interval calculated does not capture the originally skewed
nature of the data.
Instead, we need to back-transform the upper and lower limits of our confidence
interval. Just like we back transformed the mean by doing 10^x, we can do 10^ lower
bound and 10^upper bound. Here is a code chunk that shows how to back transform
the data. We’ll use variable names in place of numbers so its more clear where each
value is coming from.
#Assign the data results to our variables

upper_mean <- 1.24
upper_CI <- .31
lower_mean <- 1.34

lower_CI <- .40
#back transform our data log base 10 transformed0. If done with natural log, you would do
exp() function
transformed_upper_mean <- 10^upper_mean
transformed_upper_CI_minimum <- 10^(upper_mean-upper_CI)
transformed_upper_CI_maximum <- 10^(upper_mean+upper_CI)

transformed_lower_mean <- 10^lower_mean
transformed_lower_CI_minimum <- 10^(lower_mean-lower_CI)
transformed_lower_CI_maximum <- 10^(lower_mean+lower_CI)
The code below shows the inverse of various data transformations. Note that these
transformations are reciprocal. Meaning that to back transform square root transformed
data you would square the data, and to back transform squared transformed data you
would take the square root of the data.
log(x); exp(x) #Log and inverse of natural log

log10(x); 10^x #log base 10 and 10 raised to x
sqrt(x); x^2 #square root and squared
1/x; x/1 #inverse data transformation
Common Statistical Tests

Determining what test to do?
In general, there is no hard and fast rule for determining which statistical test to perform.
The choice of test depends on a variety of factors such as the type of data being
analyzed, the research question being asked, and the specific hypothesis being tested.
When determining which test, we also need to be aware of the assumptions of that test
as well.
There are three questions you should ask before running any statistical test:
1. What questions am I trying to answer?
2. What type of data have I collected/am about to collect?
3. What are the assumptions of the statistical tests?
It is also important to consider the level of measurement of the variables being

analyzed. For example, if you are analyzing 2 categorical variables, a contingency table
and chi-square test may be appropriate. If one variable is categorical and the other is
continuous, a t-test or analysis of variance (ANOVA) may be more appropriate. If both
variables are continuous, correlation and linear regression may be appropriate.

In the following sections we’ll explore common statistical analyses and detail their
assumptions, what type of data they need, and provide biological examples for each.
T-test
T-tests are a statistical analyses that simply tell us if the differences between 2 means
are significant or not. With t-tests we could test for differences in mean limb length
between 2 rodent populations, daily temperature between 2 plots of land, or the time
spent foraging between a control and an experimental treatment.
Assumptions:
Continuous Data: the data are continuous variables
Normality: the data are normally distributed
Independence: the observations are independent of one another
Homogeneity of variances: the variances of the groups being compared are equal
(for unpaired t-tests)
Null hypothesis: “There is no difference in the means between groups”.

Alternative Hypothesis: “There is a difference in the means between groups”.
Broadly, T-tests can be broken into unpaired and paired t-tests.
An unpaired t-test, also known as an independent t-test, is used when we have two
unrelated groups. For example, daily temperature readings between 2 plots of land or
average foraging time between 2 populations of European Bison. Each group (each plot
of land in the former, and each population in the latter) is independent of one another.
A paired t-test, also known as the dependent t-test, is used when we have two groups
that are related (dependent) in some way. Most often, this is when you measure the
same individual or area multiple times often after an experimental trial. One example
may be testing mean population densities of Green Anole Lizards in 10 sites before and
after a flood. Here we would test for differences in mean population density before the
flood and after the flood. However, the sites would remain the same, making our
analysis dependent on those sites itself.
Another example could be testing average foraging time of 10 Thomson’s Gazelles in
Morning and Evening hours. The two groups we would be testing is Morning and

Evening, however since we are testing the same 10 Gazelles, our test would be
dependent on the individuals.
Unpaired T-test
In R, the t-test is run using the function t.test(). We’ll set paired = FALSE for the
unpaired t.test and paired = TRUE for the paired. The variable we put first
(body_mass_g1) will be considered the X variable, while the variable in the second
position (body_mass_g2) is considered Y.
set.seed(123456)
body_mass_g1 = rnorm(100,60,12)
body_mass_g2 = rnorm(100,45,12)
t.test(body_mass_g1,body_mass_g2, paired = FALSE)
##
## Welch Two Sample t-test
##
## data: body_mass_g1 and body_mass_g2
## t = 9.3705, df = 197.98, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
##95 percent confidence interval:
##12.41382 19.03149
## sample estimates:
## mean of x mean of y
## 60.20184 44.47918
The results will show us the p-value, the confidence interval, and the means of each
group. The t-value and degrees of freedom are used to calculate the p-value.
In short, the T-value tells us the amount of difference between the groups. The larger
the score, the more differences between the 2 groups. The closer to 0, the less
differences between groups. In conjunction with our degrees of freedom, R calculates
the p-value. Remember, we’ll use a cutoff of .05 for significance.
The confidence interval shows you the interval for the mean differences between the
samples. We can find the mean difference simply by subtracting one sample mean from
the other (in this case mean of X is 60 and mean of Y is 44). Remember, our null
hypothesis is that this difference would be equal to 0.

For the effect size, we’ll use Cohens D. We’ll use the effsize library to use Cohens-D as
well as other metrics of effect size. The function takes in the two variables needed for
the t.test.
#Install and load if not already done

#install.packages("effsize")
#library(effsize)
effsize::cohen.d(body_mass_g1,body_mass_g2)
## Cohen's d
##
## d estimate: 1.325185 (large)
## 95 percent confidence interval:
## lower upper
## 1.017207 1.633163
Here we see a large effect size.
All in all we would interpret our results as such: “Given our low p-value we are able to
reject the null hypothesis and provide evidence for a significant difference between
mean body mass in the 2 groups. We find a large effect size provided by Cohen’s D
indicating a substantial difference between the two groups, with a mean mass of 60g for
group 1 and a mean mass of 44g for group 2. We find a mean difference of 16g with a
95% confidence interval of 12g-19g.”
Paired T-test
Now lets do an Paired t-test. Instead of body mass, we’ll measure mean foraging time of
Mule deer on 10 plots of land before and after the plots undergo a prescribed burn. As a
reminder, this a paired t-test because the we are measuring the same variable (foraging
time) on the same plots of land. Thus our test is dependent on those plots.
Our code is almost identical, however this time we have set paired = TRUE to indicate
we want to run a paired T-test.
set.seed(123456)
foraging_time_preburn = rnorm(100,10,2)

foraging_time_postburn = rnorm(100,11,2)
t.test(foraging_time_preburn,foraging_time_postburn, paired = TRUE)
##
## Paired t-test
##
## data: foraging_time_preburn and foraging_time_postburn
## t = -2.8538, df = 99, p-value = 0.005261
## alternative hypothesis: true mean difference is not equal to 0
## -1.4910978 -0.2680168
## mean difference
## -0.8795573
Our results output contains largely the same information as our Unpaired t-test. A t-
value, degrees of freedom, P value, and Confidence Intervals. the only difference is that
we do not get the mean of each group. Rather we get the mean difference between the
groups. We could of course get the mean for each group by running
mean(foraging_time_preburn) in R, but lets continue on with our effect size calculation.
effsize::cohen.d(foraging_time_postburn,foraging_time_preburn)
##
## Cohen's d
##
## d estimate: 0.4448012 (small)
## lower upper
## 0.1624883 0.7271141
And now we’ve done a full t-test! We should take note of a few things though before we
make our interpretation. Our data is significant (p value is less than .05) however, our
effect size is considered small. Remember back to the first section of our course when
we talked about the difference between significant and meaningful results?
This is a case where you would need to approach these results cautiously. There is a
detectable difference with a sufficiently low p-value, but how meaningful is that
difference? Statistically, its a very small difference. Any smaller it may even be

insignificant. Consider your study and understanding of the system when determining
how much of a difference this actually makes.
And you may be wondering how much of a difference is this actually? The numbers are
saying -.87 difference in minutes which is a little hard to intuit. This is actually a back
transformation! You likely entered your time data in terms of decimals instead of minute
and seconds. If we converted this to seconds, that would be roughly 52 seconds. If we
converted this to a percentage, that is roughly a ~9% increase in foraging time after a
burn. Now we have a better understanding of this slight increase in foraging time,
making it a little easier for us to interpret.
An example interpretation could be “Given our low p-value we are able to reject the null
hypothesis and provide evidence for a significant difference between mean foraging
time before and after prescribed burns. However, our effect size is rather low (Cohen’s
D = .45). Although Mule deer are foraging for longer periods of time after a burn, the
amount of time spent foraging is only a 9% increase (~.87 minutes 52 seconds). This
could be because after a fire, suitable forage may be harder to find.”
We should be aware that this increase may be significant but largely inconsequential. In
other words, is this increase going to dramatically affect this organism? This is an
important lesson with statistics. Whenever we have anything significant we need to ask
“so what?”. Yes there are differences, but are those differences going to make any real
world impacts?
Perhaps you know that in this system, Mule deer a 10 fold increase in predator risk for
every extra minute spent foraging. Then while the statistically this test is
inconsequential, the real world impacts could be dramatic.
However, consider a different scenario where you know that these deer fluctuate their
foraging time regularly and last year at this same time you recorded a foraging time that
was on average 4 minutes longer than what we see in this post-burn trial. You could
very easily come to the conclusion that this slight increase in foraging time is
inconsequential. Something that could be important if you are trying to convince
someone that prescribed burns aren’t going to impact the deer much!
Remember, with statistics we need to understand what each component of our tests is
telling us, but also be able to connect it to the real world.

Correlation
Correlation measure the strength and direction of a linear relationship between 2
quantitative variables. For example, the correlation between river speed and turtle size,
the correlation between skull width to skull length, or the correlation between
temperature and humidity.
The assumptions of correlation include:
Linearity: the relationship between the two variables being correlated is linear
Homoscedasticity: the variance of the residuals is constant across all levels of the
predictor variable(s)
Normality: the residuals are normally distributed
Null hypothesis: There is no correlation between X and Y
Alternative: There is a correlation between X and Y

In R, we can use 2 functions: cor and cor.test
cor provides us with just the correlation coefficient, the r value, while cor.test provides us
with correlation coefficient as well as more information about our statistical tests. In the

example below turtles were caught in a river system and two measurements were
taken, the length of the turtle in cm and the river speed at the point it was caught in.
set.seed(123456)
riverspeed <- rnorm(100,7,1)

turtle_length <- riverspeed + rnorm(100,30,1)
cor.test(riverspeed,turtle_length)
##
## Pearson's product-moment correlation
##
## data: riverspeed and turtle_length
## t = 8.0591, df = 98, p-value = 1.897e-12
## alternative hypothesis: true correlation is not equal to 0
## 0.4964851 0.7364323
## cor
## 0.6313361
Our output provides with the t-value, the degrees of freedom and a p-value that we can
interpret in the same way as our t-test output. We then get a 95% confidence interval for
our correlation coefficient shown directly below under cor. The correlation coefficient,
abbreviated as r, is what tells us how strong or weak our correlation is. i.e. it is a
measure of effect size! r can be positive or negative and is on a scale from -1 to 1. Its
distance away from 0 indicates the strength of measurable effect. e.g. a value of .2 is
considered a weak positive correlation while a value of .9 is a strong correlation. Of
course, this works for negative correlations as well with -.2 being a weak correlation and
-.9 being strong.
See this example below to better illustrate effect size of correlations under different 4
values. In a perfect correlation (1 or -1), all the points fall on the line of best fit. However,
with weaker and weaker correlations (closer to 0), the points fall farther and farther
away from the line of best fit. Ultimately, the line with slope of 0 indicates No correlation
between the two variables.

In the example above, also notice 3 out of the 4 plots are still significant at p values
below .05. With a perfect correlation, its very likely that there is some error in your
analyses. Perfect correlations almost never happen with real data! And with a
significant, but weak, correlation be sure to interpret that appropriately given what we
know about significance and meaning with statistics. As the old saying goes, correlation
does not mean causation!
Simple Linear Regression

Linear models are a statistical method used to approximate the relationship between
variables. Under a regression model, we calculate an equation of a line that best
describes the data. In fact, that’s what regression means! Regressing the points into a
line. In a simple linear regression, we are working with 2 continuous variables. A few
examples may be a linear model to see how elevation influences species richness, the
relationship between length and mass of carpet pythons, or how bat activity is
influenced by the illumination levels of the moon.
Conceptually, linear regressions are intrinsically linked to correlations. While correlations
tell us the strength and direction of the relationship, regressions create the equation

describing the actual relationship.
Assumptions:
Linearity: the relationship between the two variables is linear
Homoscedasticity: the variance of the residuals is constant across all levels of the
predictor variable(s)
Normality: the residuals are normally distributed
Null hypothesis: There is no relationship between X and Y. The slope of the line is 0.
Alternative hypothesis: There is a relationship between X and Y. The slope of the line is
not 0.
Linear models can be used to make predictions about the dependent variable based on
the value of the independent variable, and can also be used to test hypotheses about
the relationship between the variables.
In simple linear regressions, we are trying to calculate the equation of the line:
y = b0 + b1*X

b0 is the intercept of the data. Its where the data intersects with the y axis.
B1 is the slope of the data, or how much a change in X (the independent variable)
affects the dependent variable.
In R, we conduct simple linear regressions using the function lm(). We specify the model
through a formula. Luckily, for simple linear regressions, the formula is quite basic
Y ~ X or dependent variable ~ independent variable.
In these cases, I think of the ~ as meaning “affected by”. In this way, we can say “I want
to run a linear regression that tells me how the dependent variable is affected by the
independent variable. Lets use the same example as our correlation section to see how
turtle length is affected by river speed.
set.seed(123456)
#Null Hypothesis: Slope is equal to 0#Alternative Hypothesis: Slope is not equal to 0
riverspeed <- rnorm(100,7,1)
turtle_length <- riverspeed + rnorm(100,30,1)
model <- lm(turtle_length ~ riverspeed)

model

Here we assigned the output of our linear regression to an object named model. If we
run model, it outputs the formula used as well as our intercept (31.45) and the slope of
the data (.7874).
##
## Call:
## lm(formula = turtle_length ~ riverspeed)
##
## Coefficients:
## (Intercept) riverspeed
## 31.4482 0.7874
We can get more detailed results by adding summary() to our model.
summary(model)
##
## Call:
## lm(formula = turtle_length ~ riverspeed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.59387 -0.50706 -0.00978 0.63399 2.08698
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 31.44816 0.69236 45.422 < 2e-16 ***
## riverspeed 0.78743 0.09771 8.059 1.9e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9659 on 98 degrees of freedom
## Multiple R-squared: 0.3986, Adjusted R-squared: 0.3924
## F-statistic: 64.95 on 1 and 98 DF, p-value: 1.897e-12
The first lines of the model summary show us the formula used. Then we have our
residuals. Residuals describe the difference between our observed and expected
points. If we remember back to our correlation section, if we have a perfect correlation
where all points fall on the line of best fit, our residuals would be 0, as there is no
difference between the expected and observed. However, as the relationship between

variables becomes less significant we expect our residual values to increase as the
points fall further away from the line of best fit.
The coefficients section is the main section for understanding our model.
Mathematically, coefficients are values that estimate unknown parameters of a
population. In a simple linear regression with 1 independent and 1 dependent variable,
we will have 2 coefficients. Under the coefficients section we have that coefficient as
well as the standard error, the t-value, and the p-value.
The first coefficient will be the intercept (a constant value) that simply tells us where the
line of best fit intersects on the y axis. Here we see our intercept is at 31.448 and it is
significant! We’ll expand on this in just a second.
The second coefficient is the slope of the data, which tells us how many units our
dependent variable changes for every 1 unit change in our dependent variable. In this

example here, this means that for every 1 unit increase in river speed, we would have
.787 increase in turtle length. This value is also significant!
Lets plot our data in order to better understand what these values mean. We’ll add the
line of best fit using the function abline() and inputting our linear model.
plot(riverspeed,turtle_length)
abline(lm(turtle_length ~ riverspeed))
Lets first talk about our intercept. The intercept is simply where the value crosses the y
axis. In many biological cases, it will be significant EVEN IF THE INDEPENDENT
VARIABLES DO NOT AFFECT THE MODEL. This is critical when understanding our
models. The null hypothesis for this test is that the “slope does not differ from 0”. Of
course, it makes biological sense that our intercept would not be at x=0. After all, our
turtle measurements are between ~25-35. None of them are at the 0 line. In many
cases, you would interpret the intercept in terms of what makes sense biologically.

One thing to note, is that our intercept is at ~31 but on the graph above you might think
its at around 34.6. This is a common misunderstanding due to how R plots data. R will
plot your data and crop the axes to showcase all of the data with very little “dead
space”. Look at the x axis, it starts at around 4 and goes until around 10. Remember,
the intercept is where X = 0. Lets show the plot again but where we set the axis limits
ourselves.
plot(riverspeed,turtle_length, xlim = c(0,10), ylim = c(30,40))

abline(lm(turtle_length ~ riverspeed))
Now we can clearly see that when X is equal to 0, turtle length would be estimated at
around 31.4.
Lets also revisit our estimate of B1, the slope of our regression! Here the p-value is
significant, telling us that our independent variable has a significant effect on our
dependent variable. This directly relates back to our hypotheses where our null

hypothesis is that the slope is equal to 0. Since our value is not 0 and is significant, we
are able to reject this null!
Now lets talk about the effect size, we would use the correlation coefficient (r) and R^2
to describe the effect size. We already covered r during correlation, but what is R^2?
cor(riverspeed,turtle_length)^2
## [1] 0.3985853
Simply put, it is r squared! This results in a value from 0-1 that quantifies the % of
variation in our dependent variable that can be explained by our independent variable.
Here our R^2 of .39, can be interpreted as 39% of the variation in turtle size can be
explained by water speed. Generally speaking R^2 values under .5 are considered
weak, between .5 and .7 are considered moderate, and .7-1 are considered strong
effect sizes.
We can also see this information in the last section of our model output (including an
adjusted R squared for small datasets). There is also where we can find a p-value for
the overall model. In a simple linear regression, this is typically the same as the slopes
p-value, but for models with more than 1 variable this will be different.
So lets interpret this data!

The result of our linear model provide us with this equation of the line: y =
31.84+.787(x). Our model is significant (p <.05) and we can reject the null hypothesis as
we have evidence that our slope significantly differs from 0. R^2 for this model is .39
meaning 39% of the variation in turtle size can be explained by water speed, while 61%
is influenced by other factors. This is a relatively weak effect size, so although water
speed does have a significant relationship with turtle speed, there may be other
variables with a greater influence in our system.
Analysis of variance (ANOVA)

ANOVA stands for analysis of variance it is used to compare the means of three or more
groups. Under the hood, ANOVA’s are actually a class of linear model, however, for our

purposes consider them similar to a T-test, but for more than 2 groups. Here, we would
supply the ANOVA with a categorical independent variable and a numerical dependent
variable. e.g. could be lifespan of ghost shrimp in 3 different water treatments,
phylogenetic diversity between all 7 continents, or how wingspan differs between
different bat species.
The assumptions of ANOVA include:
Normality: the data is normally distributed within each group being compared
Homogeneity of variances: the variance of the dependent variable is equal across

all groups being compared
Independence: the observations are independent of one another within each group
being compared
In an ANOVA, our hypotheses are as follows:

Null: There is no differences in the means between groups
Alternative: There is a differences in the means between groups
For this example dataset we will look at the Diameter at Breast Height for tree species
in forest, grassland, and riparian areas.
set.seed(123456)
#Null Hypothesis: There are no differences in means between groups#Alternative hypothesis:

there are differences in means between groups
g1 <- data.frame(habitat = "forest",

dbh = rnorm(100,30,10))
g2 <- data.frame(habitat = "grassland",

dbh = rnorm(100,28,12))
g3 <- data.frame(habitat = "riparian",

dbh = rnorm(100,45,10))
df <- rbind(g1,g2,g3)
Lets first make a boxplot to understand what our data looks like. Notice that our boxplot
function uses the same formula as a linear model.

boxplot(df$dbh ~df$habitat)
From this data its a bit unclear if there are any significant differences. Lets run our
ANOVA to get a better idea of the data. Here we use the function aov() and input our
variables similarly to our linear models in the last section. We’ll also use summary() to
get a full idea of our data.
anova <- aov(dbh~habitat, data = df)
summary(anova)
Df | Sum Sq | Mean Sq | F value | Pr(>F)

habitat 2 | 20230 | 10115 | 90.72 | <2e-16 ***
Residuals 297 | 33116 | 112 | |
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Our output is relatively straightforward. Note that I had added vertical lines on this
output for visualization. We have 2 rows with the first describing the main grouping
variable and the second describing our residuals. Here I want to focus on our Sum of
Squares.
Sum of squares tells us how much variation there is in our data. Each row corresponds
to what variation is caused by. The first row being our habitat variable, and the second
row being our residuals. The residual sum of squares (second row) tells us how much of
the variation in our model is caused by other variables not included in our data set.
Typically, we would compare the Sum of Squares for our variable under question, to the
residuals. Their relative score tells us which is more influenced by which.
Lets take an example where there is effectively no difference between groups. This
dataset below sets the rnorm() function for each group as identical with the same mean
and variation.
set.seed(123456)
g1 <- data.frame(habitat = "forest",

dbh = rnorm(100,30,10))
g2 <- data.frame(habitat = "grassland",

dbh = rnorm(100,30,10))
g3 <- data.frame(habitat = "riparian",

dbh = rnorm(100,30,10))
df <- rbind(g1,g2,g3)
summary(anova)
Df Sum Sq Mean Sq F value Pr(>F)

habitat 2 117 58.69 0.603 0.548
Residuals 297 28899 97.30
Notice how different the sum of squares is in this example. For the grouping variable
habitat, which again now has no real differences between the groups, the sum of
squares is only 117, compared to the residuals 28,899! An extreme discrepancy! In this

example, none of the variation is explained by habitat and any variation we do see is
explained by variables not captured in our study.
It should be noted that the values from the Sum of Squares are best used for
comparison sake within a study and can rarely be compared between analyses. In
short, there is no threshold that makes a Sum of Squares value high or not. A value of
30,000 may be high for our study, while for another study a value of 3 may be
considered high! Use them appropriately by comparing them the other values in our
dataset. If running multiple models on the same data, you can use Sum of Squares to
determine which model has the lowest variability.
Looking back at our original analyses:
Df | Sum Sq | Mean Sq | F value | Pr(>F)

habitat 2 | 20230 | 10115 | 90.72 | <2e-16 ***
Residuals 297 | 33116 | 112 | |
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
We have a significant p value! Meaning that there are significant differences between
groups in our data! Hooray! But lets go back to our hypotheses:
Null: There is no differences in the means between groups
Alternative: There is a differences in the means between groups
These hypotheses are simply stating that there are or are not differences in the means
between groups, it does not specify which groups have differences nor does it
specify how those differences appear.
That’s why we need a post-hoc test. A post-hoc test literally means a test you run after
the fact. Assuming that there is an overall significant difference in the ANOVA, we can
run Tukeys Post-hoc test to see what significant differences there are between groups.
This is easy enough. with the TukeyHSD() function.
#summary(anova)
TukeyHSD(anova)

## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = dbh ~ habitat, data = df)
##
## $habitat
## diff lwr upr p adj
## grassland-forest -2.689017 -6.206604 0.8285705 0.1711127
## riparian-forest 15.919079 12.401492 19.4366665 0.0000000
## riparian-grassland 18.608096 15.090509 22.1256834 0.0000000
This test will run statistical tests for each pairing of groups in our ANOVA. Its output
gives us the differences in the means between the 2 groups, the lower bounds, the
upper bounds, and a p-value for each comparison. Note that 0 means the p value is
extremely small (.00000000000001 for example), and thus we could reject the null at
our .05 cutoff.
Here, we that grassland-forests are not significantly different while riparian is

significantly different from both forest and grassland. If you we relook at our boxplot
above, we can see that this actually makes sense! It does not appear that grassland
and forest are all that different while riparian habitats have a higher DBH on average.
Taking in all this information, we could make the following interpretation:
Based on our ANOVA analysis, we conclude there are significant differences in DBH
between habitat types. We can reject the null hypothesis as there is evidence for the
alternative. Tukey’s post-hoc test reveals that DBH is significantly higher in riparian
habitats compared to forest (difference of 15.92 cm) and grassland habitats (difference
of 18.61 cm). However, we find no significant difference between grassland and forest
habitats. Our Sum of Squares reveals that much of the variation can be explained by
different habitats (Sum of Squares = 20230), but variation from outside sources not
considered in this study are also apparent (Residual Sum of Squares = 33116).
Contingency table and Chi Square

Contingency tables are used to analyze the relationship between counts of two
categorical variables. For example, a contingency table could be used to compare the
counts of 3 species of birds in different habitats, or the number of juvenile and adult
frogs found across 4 different sites.

The chi-square test is often used to analyze contingency tables and determine if there is
a significant association between the two variables being compared. Specifically, it
compares the observed counts of data and tests if they are significantly difference than
what is expected.
Assumptions:
Both variables are categorical – The numbers utilized are the counts of those
variable pairs (e.g. Juvenile birds at site 1, adult birds at site 2 etc etc)
Independent – The data are independent of one another
Mutually exclusive cells – This means that every count is mutually exclusive to a
single cell.
More than 5 observations per cell
When running a chi-squared test over these data, our hypotheses are:
Null: There is no difference between the expected and observed counts
Alternative: There is a difference in the expected and observed counts
In our example, we create a simple contingency table around birds that were
categorized according to two variables: plumage color and sex. Each individual is coded

as: red-female, red-male, green-female, and green-male. Again mutually exclusive,
there are no individuals that would have been put in two categories.
set.seed(123456)
#Null: The observed counts do not significantly differ from expected#Alternative: The obse
rved counts do significantly differ from expected
g1 <- c(sample(c("Green", "Red"), 500, replace = TRUE, prob = c(.60,.40)),

sample(c("Green", "Red"), 500, replace = TRUE, prob = c(.75,.25)))
g2 <- c(rep("MALE",500),rep("FEMALE",500))
df <- data.frame(g1 = g1, g2 = g2)
table(df)
## g2
## g1 FEMALE MALE
## Green 376 314
## Red 124 186
We use the table() function to create our contingency table based on our data. We can
see that our data set has more Green individuals (376+314 = 690) than red individuals
(124+186 = 310). Additionally, we have equivalent numbers of Female (376+124 = 500)
and Male (314+186 = 500) birds. Now the question is, are there any significant
differences between these groups? In essence, do we have any combinations of color
and sex that are observed significantly more or less than we expect?
We’ll first use chisq.test() to get at this question.
chisq.test(table(df))#rdefault
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: table(df)
## X-squared = 17.396, df = 1, p-value = 3.035e-05

The Chi-squared test provides us with a very simple output. X-squared and df are used
to calculate the p-value. In this case it is significant!
But, like our ANOVA test, which group is significantly different than what we expected?
We can’t actually get that information simply from this output. While there are several
methods to find this out, my favorite is to use the CrossTable function from the package
gmodels!
We will have to set various arguments to get the outputs we want. Additionally, this
function will run the Chi Squared test for us. We are going to use standardized residuals
to test which groups are significantly different or not. The comments in the code below
explains each argument we set and why.
#Lets do CrossTable from the Gmodels package to get a better understanding of our data#Thi
s function does not require a contingency table, we can supply the data directly#We will w
ant to set expected = TRUE. This does 2 things:#'#'1) It shows us the expected values from
the dataset. Useful for interpretation!#'2) IT runs the ChiSq test. We can also set Chisq
= TRUE if we want instead#'#' We then want to set sresID = TRUE. This will show us the st
andardized residuals as a Z-score, allowing us to tell what groups are significant. We use
the cutoff +-1.96#' Then we add format = "SPSS" This formats in a way that we can actually
see the residuals#'#install.packages("gmodels") #Needed if you havent installed the gmodel
s package yet
gmodels::CrossTable(df$g1,df$g2, sresid = TRUE, format = "SPSS", expected = TRUE)
##
## Cell Contents
## |-------------------------|
## | Count |
## | Expected Values |
## | Chi-square contribution |
## | Row Percent |
## | Column Percent |
## | Total Percent |
## | Std Residual |
## |-------------------------|
##
## Total Observations in Table: 1000
##
## | df$g2
## df$g1 | FEMALE | MALE | Row Total |
## -------------|-----------|-----------|-----------|
## Green | 376 | 314 | 690 |
## | 345.000 | 345.000 | |
## | 2.786 | 2.786 | |
## | 54.493% | 45.507% | 69.000% |

## | 75.200% | 62.800% | |
## | 37.600% | 31.400% | |
## | 1.669| -1.669| |
## -------------|-----------|-----------|-----------|
## Red | 124 | 186 | 310 |
## | 155.000 | 155.000 | |
## | 6.200 | 6.200 | |
## | 40.000% | 60.000% | 31.000% |
## | 24.800% | 37.200% | |
## | 12.400% | 18.600% | |
## | -2.490 |2.490 | |
## -------------|-----------|-----------|-----------|
## Column Total | 500 | 500 | 1000 |
## | 50.000% | 50.000% | |
## -------------|-----------|-----------|-----------|
##
##
## Statistics for All Table Factors
##
##
## Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 17.97101 d.f. = 1 p = 2.242944e-05
##
## Pearson's Chi-squared test with Yates' continuity correction
## ------------------------------------------------------------
## Chi^2 = 17.39598 d.f. = 1 p = 3.034672e-05
##
##
## Minimum expected frequency: 155
The output is quite large but we can break it down step by step. Each cell in the table
corresponds to one of the variable pairs (e.g. Male-Green, Female-Red). At the top of
our output we have the key for the cell contents. We can see the observed and
expected counts of our dataset (again, this is what Chi-Squared is testing).
Then we have the percent contribution for rows, columns, and the total dataset. e.g. If
there were 10 observations in a row of 20 total observations, its row contribution would
be 10/20 or 50%!
Then we have our standardized residuals. Remember back to our sections early on in
the course about probability distributions and standard deviations. We said that values
falling over 1.96 (rounded to 2) or under -1.96 (rounded to -2) are considered
significantly significant at a confidence interval of 95%. These standard residuals are
based directly on this cutoff and we can use them to determine significant groups!

Looking at the row for green, our standard residuals both fall within our -1.96 to 1.96
cutoffs. From this we would say that there is no significant association between green
coloration and sex of the birds. Again, even though our Chi-squared test is significant,
that does not mean all comparisons under study are significant.
Lets now look at the row for red birds. For Female-red birds, our residual is -2.49 and
for Male-Red birds our residual is 2.49. Based on our +-1.96 cutoff, both of these values
are significantly different from what we expect. With negative residuals, we have
significantly fewer observations than expected and with positive residuals we have
significantly more than we expected. We can double check this principle by looking at
the top values in our cross table. For Female-Red birds (SRES = -2.49), we expected
155 individuals, but we have 124. For Male-Red birds (SRES = 2.49), we expected 155
individual, but we have 186. This tells us that red coloration has some sort of
association with the sex of the bird!
Tying everything together, our interpretation may be as follows:
Based on the results of our Chi-Squared test (X^2 = 17.396; P <.0000001) we detect a
significant association between bird coloration and their sex. We are able to reject our
null hypothesis as we have evidence for the alternative that expected counts
significantly differ from the observed. Using our standardized residuals we find no
significant difference between green coloration and the sex of the bird. In essence,
Green birds are equally likely to be female (SRES = -1.669) or female (SRES=1.669).
However, Red birds are significantly less likely to be females (SRES = -2.49) and vice
versa more likely to be male (SRES = 2.49). There may be some type of sexual
selection or other behavioral differences between the sexes that creates this pattern,
however the exact reasons are beyond the scope of this analysis.
Resources, Glossary, More information

Glossary
Alternative Hypothesis - The alternative hypothesis (Ha or H1) is a statement in
statistical hypothesis testing that contradicts the null hypothesis. It suggests that there is
a significant difference or effect between groups or variables, and researchers aim to
gather evidence to support the alternative hypothesis.

Analysis - Analysis involves examining data to uncover patterns, relationships, or
insights, often using statistical techniques or mathematical methods.
ANOVA (Analysis of Variance) - ANOVA, or Analysis of Variance, is a statistical test
used to analyze the differences in means among multiple groups or levels of an
independent variable. It assesses whether there are statistically significant differences
between group means.
Association - A relationship or connection between variables.
Assumption - An underlying condition or belief that is considered true for the purpose
of statistical analysis.
Back Transformation - Back transformation is the process of converting the results of
a transformed data analysis back to their original scale. It allows researchers to interpret
results in the original units of measurement and is necessary for meaningful data
interpretation.
Bartlett Test - Bartlett test is a statistical test used to assess the homogeneity of
variances across different groups or levels of an independent variable. It tests the null
hypothesis that variances are equal between groups.
Bimodal Distribution - A distribution with two distinct peaks or modes.
Binomial Distribution - A probability distribution modeling binary outcomes like yes-no
or success-fail for discrete data
Categorical Variable - A variable that represents categories or groups.
Central Limit Theorem - A statistical theory stating that the distribution of sample
means approximates a normal distribution.
Chi-Square Test - A statistical test used to analyze contingency tables and assess
associations between categorical variables.
Coefficient of Variation (CV) - A relative measure of variability calculated as the ratio
of the standard deviation to the mean.
Coefficients - Values that estimate unknown parameters of a population in a regression
model.
Cohen's d - Effect size measure indicating the difference between two means.
Confidence Interval (CI) - A confidence interval (CI) is a range of values that is likely to
contain the true population parameter with a specified level of confidence. Common

confidence levels include 95% and 99%. A narrower CI indicates greater precision.
Confounding Variables - Confounding variables are extraneous factors or variables
that can distort the observed relationship between the independent and dependent
variables. They may lead to incorrect conclusions if not controlled for in a study.
Contingency table - A table used to display the relationships between two or more
categorical variables.
Continuous Data - Data that can take on any value within a given range and can be
measured with great accuracy.
Continuous variables - Variables that can take on a wide range of numeric values and
are not limited to specific categories.
Correlation - Correlation is a statistical measure that quantifies the degree and
direction of the linear relationship between two or more variables. It ranges from -1
(perfect negative correlation) to 1 (perfect positive correlation), with 0 indicating no
correlation.
CrossTable - A function for generating contingency tables and conducting chi-squared
tests in R.
Data - Data refers to information or facts that are collected, recorded, or observed and
can be used for analysis, interpretation, or decision-making.
Data Transformation - Data transformation involves applying mathematical operations

to raw data to change its distribution or characteristics. Common data transformations
include log transformation, square root transformation, and reciprocal transformation.
Transformations are used to meet the assumptions of statistical tests.
Degrees of freedom - The number of values in the final calculation of a statistic that
are free to vary.
Dependent variable - A variable that is measured or observed in an experiment and is
influenced by changes in the independent variable.
Discrete Data - Data that cannot be subdivided
Effect Size - Effect size measures the magnitude of a difference or relationship
between variables, helping to determine the practical or biological significance of
results.
Expected Counts - The counts predicted by the null hypothesis in a chi-squared test.

Formula - The equation used to represent a regression model.
F-statistic - A statistical test used to assess the overall significance of a regression
model.
Heteroscedasticity - Varying levels of variance in residuals across levels of the
independent variable.
Histogram - A graphical representation of the frequency distribution of data using bars.
Homogeneity of Variances - Homogeneity of variances assumes that the variance of
the dependent variable is equal across different groups or levels of an independent
variable. It is a critical assumption for some statistical tests, such as ANOVA.
Homoscedasticity - The assumption that the variance of the residuals in a regression
model is constant across all levels of the predictor variable(s).
Hypothesis - A hypothesis is a testable statement or prediction that suggests a
relationship between variables and is used as a basis for experimentation and statistical
testing.
Independence - Independence assumes that the values being measured or observed
are not related to one another in any systematic way. Independence usually is
accounted for during the data collection process of the study by ensuring the data
collected is not dependent on other variables. However, many metrics are by their very
nature non-independent. For example, temperature and rainfall can be correlated if you
are comparing temperate forests to deserts.
Independent Variable - The variable that is used to predict or explain the dependent
variable in a regression model.
Intercept - The point where a line intersects the y-axis in a linear equation.
Kolmogorov-Smirnov Test - A statistical test to assess the goodness-of-fit between
observed data and a theoretical distribution.
Kurtosis - A measure of the "tailedness" or peakedness of a data distribution.
Lambda (λ) - A parameter representing the mean and variance in the Poisson
distribution.
Linear regression - A statistical method used to model the relationship between a
dependent variable and one or more independent variables.

Linear Relationship - A linear relationship between variables implies that the change in
one variable is directly proportional to the change in another variable. In a scatterplot, a
linear relationship appears as a straight-line pattern.
Linearity - The assumption that the relationship between two variables is linear or
follows a straight line.
lm() - A function in R for conducting linear regression.
Log Transformation - Log transformation is a data transformation method that involves
taking the logarithm of each data point. It is often used to stabilize variance and make
data more symmetric, especially when dealing with right-skewed data. Common
logarithmic bases include natural logarithms (base e) and base 10 logarithms.
Logit Transformation - Logit transformation is a data transformation method used for
modeling binary or categorical data. It transforms the data into a log-odds scale, making
it suitable for logistic regression analysis.
Mathematics - Mathematics is a field of study that deals with numbers, quantities,
shapes, and patterns, providing a framework for logical reasoning and problem-solving.
Mean - The average of a set of values, calculated by summing all values and dividing
by the number of values.
Median - The middle value of a dataset when arranged in ascending or descending
order.
Mode - The most frequently occurring value in a dataset.
Monotonic Relationship - A monotonic relationship between variables indicates that
the variables consistently move in the same direction (either increasing or decreasing)
but not necessarily at a constant rate. Monotonic relationships do not have to be linear.
Multicollinearity - Multicollinearity is a statistical phenomenon in regression analysis

where two or more predictor variables in a model are highly correlated with each other.
It can lead to unstable coefficient estimates and reduced interpretability of the
regression model.
Multimodal Distribution - A distribution with multiple peaks or modes
Multiple R-squared - A measure of the proportion of variance explained by a
regression model.
Mutually Exclusive - Categories or events that do not overlap.

Non-Parametric Tests - Non-parametric tests, also known as distribution-free tests, are
statistical tests that do not rely on specific assumptions about the distribution of data.
They are used when data does not meet the assumptions of parametric tests, such as
normality.
Normal Distribution - A normal distribution, also known as a Gaussian distribution, is a

symmetric probability distribution with a bell-shaped curve. In a normal distribution, the
mean, median, and mode are equal, and data points are evenly distributed around the
mean. Many natural phenomena follow a normal distribution.
Normality Assumption - The assumption that data follows a normal distribution in
statistical analysis.
Null Hypothesis - The null hypothesis (H0) is a statement in statistical hypothesis
testing that suggests there is no significant difference or effect between groups or
variables. Researchers aim to test and, if necessary, reject the null hypothesis to
support an alternative hypothesis.
Numerical Variable - A variable that represents numerical values.
Observed Counts - The actual counts of data in a contingency table.
Outliers - Data points that significantly deviate from the rest of the dataset.
Paired t-test - A statistical test used to compare means between two related
(dependent) groups.
Parameter - A parameter in statistics is a numerical value that describes a
characteristic of a population. Common parameters include population mean, population
variance, and population proportion.
Poisson Distribution - A discrete probability distribution modeling the mean
occurrence of events
Post-hoc Test - A statistical test conducted after an initial analysis to explore specific
group differences.
Probability - Probability is a mathematical concept that measures the likelihood of
events occurring, typically represented as a number between 0 (impossible) and 1
(certain).
Probability Density - The likelihood of a specific outcome occurring in a probability
distribution.

Probability Distribution - A probability distribution is a mathematical function that
assigns probabilities to the possible outcomes of a random event or variable.
P-value - A p-value is a probability that measures the likelihood of obtaining a particular
result by random chance, often used to assess statistical significance.
QQ-Plot (Quantile-Quantile Plot) - A graphical tool to assess the normality of data by
comparing quantiles against a theoretical normal distribution.
Quantiles - Values that divide a dataset into equal portions such as quartiles or
percentiles.
Random Chance - Random chance refers to the possibility of an outcome occurring
purely by luck or randomness.
Regression Analysis - Regression analysis is a statistical technique used to model
and analyze the relationship between a dependent variable and one or more
independent variables. It aims to predict or explain the variation in the dependent
variable based on the independent variables.
Regression Model - A model that calculates an equation describing the relationship
between variables.
Residual Standard Error - A measure of the variability of residuals in a regression
model.
Residuals - The differences between observed and predicted values in regression
analysis.
Residuals vs. Fitted Plot - A graphical tool to assess the homoscedasticity assumption
in linear regression.
Row Percent - The percentage of total observations in a row of a contingency table.
Sample Size - Sample size is the number of observations or data points collected from
a population for a study or analysis. It plays a crucial role in the validity and reliability of
statistical results.
Sexual Dimorphism - Differences in physical characteristics between males and
females of the same species.
Sexual Selection - A form of natural selection where traits evolve because they
increase the chances of mating success.

Shapiro-Wilk Test - The Shapiro-Wilk test is a statistical test used to assess the
normality of data. It tests the null hypothesis that the data follows a normal distribution.
A non-significant result indicates that the data is normally distributed.
Significant - In statistics, the term 'significant' refers to the statistical significance of a
result or finding. If a result is significant, it means that the observed effect or relationship
is unlikely to have occurred by random chance.
Simple Linear Regression - A statistical method to model the relationship between two
continuous variables using a linear equation.
Skewed Data - Data that exhibits a non-symmetrical distribution
Skewness - A measure of the asymmetry of data distribution.
Slope - The rate of change in the dependent variable for a one-unit change in the
independent variable.
Standard Deviation - The standard deviation is a measure of the spread or variability
of data points around the mean, indicating how data is dispersed.
Standardized Residuals - Residuals expressed in terms of standard deviations.

Statistical Assumptions - Statistical assumptions are conditions or requirements that
must be met for the validity of a statistical test or analysis. Common assumptions
include normality, independence, and homogeneity of variances.
Significance Level (Alpha) - The statistical significance level, often denoted as alpha
(α), is the predetermined threshold used to determine the significance of a statistical
result. Common alpha levels include 0.05 and 0.01.
Statistical Tests - Statistical tests are analytical tools used in data analysis to assess
the relationships between variables, make inferences, and test hypotheses. These tests
help researchers determine the statistical significance of their findings and draw valid
conclusions from the data.
Statistics - Statistics is a branch of mathematics that involves collecting, analyzing, and
interpreting data. It provides the tools to make sense of data, interpret analysis results,
and test hypotheses.
Total Percent - The percentage of total observations in a contingency table.
T-test - A statistical test used to compare means between two groups and assess if the
differences are significant.

Tukey's Post-hoc Test - A post-hoc test used to identify significant differences between
groups in ANOVA.
Unpaired t-test - A statistical test used to compare means between two independent
groups.
Variability - The degree to which data points differ from the mean.
Variance - The measure of the spread or dispersion of data points in a dataset.
Variance Inflation Factor (VIF) - Variance Inflation Factor (VIF) is a measure used to
assess multicollinearity in regression analysis. It quantifies how much the variance of an
estimated regression coefficient is increased due to collinearity among predictor
variables. A VIF value greater than 1 indicates multicollinearity.
Yates' Continuity Correction - A correction applied to the chi-squared test for small
sample sizes.

Fundamental Biostatistics Dillon Jones

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Fundamental Biostatistics Dillon Jones

Uploaded by

Copyright:

Available Formats

Fundamental Biostatistics for

Fundamental Biostatistics for Biologists 1

Although understanding the math is important to fully comprehend the statistical

Fundamental Biostatistics for Biologists 2

Intro to Probability Distributions

Fundamental Biostatistics for Biologists 3

A discrete probability distribution of rolling each number on a single die

But let’s show a more biological example.

Fundamental Biostatistics for Biologists 4

Fundamental Biostatistics for Biologists 5

Fundamental Biostatistics for Biologists 6

Fundamental Biostatistics for Biologists 7

Standard Deviations and Confidence Intervals

Fundamental Biostatistics for Biologists 8

Standard deviation is commonly used in statistical analysis to assess the variability of a

Fundamental Biostatistics for Biologists 9

P-values - What does it mean to be significant?

Fundamental Biostatistics for Biologists 10

But are the difference between populations actually meaningful?

Fundamental Biostatistics for Biologists 11

Given this, how do we determine how meaningful a value is?

Fundamental Biostatistics for Biologists 12

Understanding our data

Fundamental Biostatistics for Biologists 13

Fundamental Biostatistics for Biologists 14

Fundamental Biostatistics for Biologists 15

Fundamental Biostatistics for Biologists 16

Understanding Assumptions about our Data

Fundamental Biostatistics for Biologists 17

#Generate normally Distributed Data

x <- rnorm(1000,100) #Generate a normal distribution of data

#plotting the density of x gives us our data distribution

#Create the QQ Plot

Fundamental Biostatistics for Biologists 18

#Non-normally distributed data

plot(density(y), main = "Data Distribution")#plot with densities

Fundamental Biostatistics for Biologists 19

Fundamental Biostatistics for Biologists 20

#Test on non-normal data

Fundamental Biostatistics for Biologists 21

#Normally distributed, equal variances

plot(aov(df$plant_height_mm ~df$group),1) #see residuals vs fitted

Fundamental Biostatistics for Biologists 22

#Normally distributed, nonequal variance.

df <- data.frame(ID = 1:100,

plot(aov(df$plant_height_mm ~df$group),1) #see residuals vs fitted

Fundamental Biostatistics for Biologists 23

The following code runs these tests over our data:

bartlett.test(data = df, plant_height_mm ~ group) #Null: variances are equal

car::leveneTest(data = df, plant_height_mm ~ group)#Null: variances are equal; Non-normal

Fundamental Biostatistics for Biologists 24

par(mfrow = c(2,2)) #This function makes it plot 4 plots at once.

#plot and understand our data

Fundamental Biostatistics for Biologists 25

cor(df$temp,df$forest_cover) #Not Significant

Fundamental Biostatistics for Biologists 26

cor(df$forest_cover,df$humidity) #Not Significant

car::vif(aov(richness ~ temp+humidity+elevation+forest_cover, data = df))

Fundamental Biostatistics for Biologists 27

#We'll remove elevation and rerun the analysis

car::vif(aov(richness ~ temp+humidity+forest_cover, data = df))

## temp humidity forest_cover