Professional Documents
Culture Documents
LABORATORY 2: 18-02-2021
Contents
1 Introduction to One-way Analysis of Variance 1
1.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1 Definition
The one-way analysis of variance (ANOVA) is used to compare several groups of a quantitative variable.
Specifically, it is a way to calculate if there are any statistically significant differences between the means of
1
three or more independent groups. It tests the following null hypothesis:
𝐻0 ∶ 𝜇1 = 𝜇2 = 𝜇3 = ... = 𝜇𝑎
1.2 Assumptions
In order to use the ANOVA test statistic, three main assumptions must be satisfied:
These assumptions must be made after the ANOVA analysis. These assumptions should be made after the
ANOVA analysis, preferably using the errors of the model.
• There are two sources of variation between the N observations obtained in the experiment:
2
• The relative size of (1) and (2) is used to indicate whether the observed difference - The difference in
treatment is said to be real if the treatment variation is greater than the experimental error.
• A statistical model is a symbolic expression in the form of equality or equation that is used in all ex-
perimental designs (Ex. ANOVA, regression) to indicate the different factors that modify the response
variable. The model associated with the data can be written as:
𝑌𝑖𝑗 = 𝜇𝑖 + 𝜖𝑖𝑗
or
𝑌𝑖𝑗 = 𝜇 + 𝛼𝑖 + 𝜖𝑖𝑗
1.3 Example
fertilizer1<-c(6.27,5.36,6.39,4.85,5.99,7.14,5.08,4.07,4.35,4.95)
fertilizer2<-c(3.07,3.29,4.04,4.19,3.41,3.75,4.87,3.94,6.28,3.15)
fertilizer3<-c(4.04,3.79,4.56,4.55,4.55,4.53,3.53,3.71,7.00,4.61)
data<-data.frame(fertilizer1,fertilizer2,fertilizer3)
data
The data represents the production of pumpkins of plots that received treatments with different fertilizers.
The question we ask ourselves here is: does the fertilizer have an effect on the pumpkin production?
To answer this question, we need to test the equal mean hypothesis and for that we have to calculate the
F-statistic.
Propose manually the model what this experiment represents:
Model proposed:
𝑌𝑖𝑗 = 𝜇 + 𝛼𝑖 + 𝜖𝑖𝑗
3
Y=pumpkins production
effect of fertilizer1 =
𝛼𝑖
First of all, we need to calculate 𝑆𝑆𝑡𝑜𝑡𝑎𝑙 , 𝑆𝑆𝑏𝑒𝑡𝑤𝑒𝑒𝑛 and 𝑆𝑆𝑤𝑖𝑡ℎ𝑖𝑛 . Use the following formula to get these
numbers.
𝑎𝑛 𝑎𝑛
𝑆𝑆𝑡𝑜𝑡𝑎𝑙 = ∑(𝑦𝑖𝑗 − 𝑦⋅⋅̄ )2 = ∑ 𝑦𝑖𝑗
2
− 𝑎𝑛𝑦⋅⋅2̄
1 1
𝑎𝑛 𝑎
𝑆𝑆𝑏𝑒𝑡𝑤𝑒𝑒𝑛 = ∑(𝑦𝑖⋅̄ − 𝑦⋅⋅̄ )2 = 𝑛 ∑ 𝑦𝑖⋅2̄ − 𝑎𝑛𝑦⋅⋅2̄
1 1
n1<-length(fertilizer1)
n2<-length(fertilizer2)
n3<-length(fertilizer3)
Now using the previous formula calculate the SST, SSB and SSW
# SS
SST<-sum(sum(fertilizer1^2),sum(fertilizer2^2),sum(fertilizer3^2)) - 3*n1*(mean_all^2)
SST
## [1] 36.4449
SSB<-(n1*sum((mean_f1^2),(mean_f2^2),(mean_f3^2))) - 3*n1*(mean_all^2)
SSB
## [1] 10.82275
4
SSW<-SST-SSB
SSW
## [1] 25.62215
# degrees of freedom
df_SST<-(3*10)-1
df_SST
## [1] 29
df_SSB<-3-1
df_SSB
## [1] 2
df_SSW<-(3*10)-3
df_SSW
## [1] 27
𝑆𝑆𝑏𝑒𝑡𝑤𝑒𝑒𝑛
𝑀 𝑆𝑏𝑒𝑡𝑤𝑒𝑒𝑛 =
𝑑𝑓𝑏𝑒𝑡𝑤𝑒𝑒𝑛
𝑆𝑆𝑤𝑖𝑡ℎ𝑖𝑛
𝑀 𝑆𝑤𝑖𝑡ℎ𝑖𝑛 =
𝑑𝑓𝑤𝑖𝑡ℎ𝑖𝑛
Calculate 𝑀 𝑆𝑏𝑒𝑡𝑤𝑒𝑒𝑛 and 𝑀 𝑆𝑤𝑖𝑡ℎ𝑖𝑛 and fill in Table 1 (see ending of file).
# degrees of freedom
MSB<-SSB/df_SSB
MSB
## [1] 5.411373
5
MSW<-SSW/df_SSW
MSW
## [1] 0.9489685
You can think of the F-statistic as a value of how far we are from the null hypothesis, the bigger is F the
more relevant will be the effect of a fertilizer. The F-statistic is given by the following formula:
𝑀 𝑆𝑏𝑒𝑡𝑤𝑒𝑒𝑛
𝐹 =
𝑀 𝑆𝑤𝑖𝑡ℎ𝑖𝑛
# degrees of freedom
F<-MSB/MSW
F #is the F statistic
## [1] 5.702374
Once the F-statistic is calculated, we need to know if it is a significant result or not. Assuming the equal
mean hypothesis, we can know the probability of obtaining a certain value of F or higher.
If you choose a critical value of e.g. 𝛼=0.05 and obtain a p-value lower than 0.05, you will reject the null
hypothesis of mean equality and conclude that not all compared means are equal. We will not know which
of the group means is different, we just know that not all compared means are equal. In order to know which
of the group means differs we will use a post-hoc test. This will be explained in the next section.
Which p-value did you obtain? What can you conclude?
## [1] 0.008594377
## p_val < 0.05, we reject the null hypothesis, the group means are different.
6
3 The ANOVA R function
So far we’ve tried to understand how the one-factor analysis of the variance works and how can we calculate
it manually. From now on, you can use an R function in order to be faster when you analyse your own data!
First, we need a special format of the data to work with R, it is called long-format data. We will use the
library “tidyr” to re-format our data:
library("tidyr")
fertilizer_data<-gather(data,fertilizer,production,fertilizer1:fertilizer3)
fertilizer_data
## fertilizer production
## 1 fertilizer1 6.27
## 2 fertilizer1 5.36
## 3 fertilizer1 6.39
## 4 fertilizer1 4.85
## 5 fertilizer1 5.99
## 6 fertilizer1 7.14
## 7 fertilizer1 5.08
## 8 fertilizer1 4.07
## 9 fertilizer1 4.35
## 10 fertilizer1 4.95
## 11 fertilizer2 3.07
## 12 fertilizer2 3.29
## 13 fertilizer2 4.04
## 14 fertilizer2 4.19
## 15 fertilizer2 3.41
## 16 fertilizer2 3.75
## 17 fertilizer2 4.87
## 18 fertilizer2 3.94
## 19 fertilizer2 6.28
## 20 fertilizer2 3.15
## 21 fertilizer3 4.04
## 22 fertilizer3 3.79
## 23 fertilizer3 4.56
## 24 fertilizer3 4.55
## 25 fertilizer3 4.55
## 26 fertilizer3 4.53
## 27 fertilizer3 3.53
## 28 fertilizer3 3.71
## 29 fertilizer3 7.00
## 30 fertilizer3 4.61
Once we have the good format of the data, we will use the R function 𝑎𝑜𝑣(). Let’s see if we completed the
ANOVA summary table correctly. Be inspired by the following example:
#EXAMPLE:
# anova.res<-aov(Y~factor,data = data.frame.name)
#
anova_out<-aov(production~fertilizer,fertilizer_data)
# production represents the column with numeric variables
7
# fertilizer represents the factors
# fertilizer_data is the name of the data frame where you will find the columns
# named production fertilizer
anova_out
## Call:
## aov(formula = production ~ fertilizer, data = fertilizer_data)
##
## Terms:
## fertilizer Residuals
## Sum of Squares 10.82275 25.62215
## Deg. of Freedom 2 27
##
## Residual standard error: 0.9741502
## Estimated effects may be unbalanced
The output of 𝑎𝑜𝑣() function is a table where you will find in the first row: the 𝑆𝑆𝑏𝑒𝑡𝑤𝑒𝑒𝑛 (first column)
and 𝑆𝑆𝑤𝑖𝑡ℎ𝑖𝑛 (also called Residuals, second column). In the second row you’ll find the degrees of freedom of
𝑆𝑆𝑏𝑒𝑡𝑤𝑒𝑒𝑛 (a-1) and 𝑆𝑆𝑤𝑖𝑡ℎ𝑖𝑛 (an-a). You probably noticed that no F-statistic nor p-value are given in the
output. We need to use the function 𝑠𝑢𝑚𝑚𝑎𝑟𝑦() to get both.
summary(anova_out)
library("ggpubr")
ggline(fertilizer_data, x = "fertilizer", y = "production",
add = c("mean_se", "jitter"),
order = c("fertilizer1", "fertilizer2", "fertilizer3"),
ylab = "production", xlab = "fertilizer")
8
7
6
production
Once a significant result is obtained with the one-way ANOVA, a verification of the assumptions of the model
must be made as a diagnosis and correction of the model, since the conclusions we reach may be incorrect.
There are two ways to do it, graphically and through hypothesis testing and and we will always work with
model errors: Error(model)
A better graphical way to look at data normality is to perform a QQ plot. A histogram shows the frequencies
of different values in the variable (counts). Depending on how the histogram looks it can be misleading. It’s
better to use the QQ plot. A Q-Q plot shows the mapping between the distribution of the data and the ideal
distribution (the normal distribution in this case). Usually a line is plotted through the quartiles. When the
dots follow the line closely, the data has a normal distribution.
In this case we can use the object Error(anova_out) obtained previosly, that contains the model errors
result:
#errors: Yi - model_i
residuals(anova_out)
## 1 2 3 4 5 6 7 8 9 10 11
## 0.825 -0.085 0.945 -0.595 0.545 1.695 -0.365 -1.375 -1.095 -0.495 -0.929
## 12 13 14 15 16 17 18 19 20 21 22
## -0.709 0.041 0.191 -0.589 -0.249 0.871 -0.059 2.281 -0.849 -0.447 -0.697
## 23 24 25 26 27 28 29 30
## 0.073 0.063 0.063 0.043 -0.957 -0.777 2.513 0.123
# make a QQ plot
qqnorm(residuals(anova_out), main = "QQ plot")
# add a QQ line
qqline(residuals(anova_out), col=2)
9
QQ plot
2
1
Sample Quantiles
0
−1
−2 −1 0 1 2
Theoretical Quantiles
The graphical methods for checking data normality in R still leave much to your own interpretation. There’s
much discussion in the statistical world about the meaning of these plots and what can be seen as normal.
If you show any of these plots to ten different statisticians, you can get ten different answers. That’s quite
an achievement when you expect a simple yes or no, but statisticians don’t do simple answers.
On the contrary, everything in statistics revolves around measuring uncertainty. This uncertainty is sum-
marized in a probability — often called a p-value — and to calculate this probability, you need a formal
test.
Probably the most widely used test for normality is the Kolmogorov-Smirnov, other is the Shapiro-Wilks
test. The function to perform this test, is called ks.test().
Using the statistical test called Kolmogorov-Smirnov check if the errors are normally distributed.
##
## One-sample Kolmogorov-Smirnov test
##
## data: residuals(anova_out)
## D = 0.18616, p-value = 0.2496
## alternative hypothesis: two-sided
If the variances are not similar between the groups and the variable is not normally distributed, we cannot
use ANOVA directly and it can affect the conclusions obtained. Are the variances equal between factor levels
(homocesdacity)?
To check this we can apply Bartlett test.
10
bartlett.test(response ~ factor, data1)
bartlett.test(production~fertilizer,fertilizer_data)
##
## Bartlett test of homogeneity of variances
##
## data: production by fertilizer
## Bartlett's K-squared = 0.00017048, df = 2, p-value = 0.9999
Remember what to do when errors-model are not normal and / or homocedastic?: To do it we try to
transform the data to fit it into a normal distribution. We will apply the log10 transformation to our data.
Use the log10() function. Calculate the ANOVA of the newly transformed data (OPTIONALLY !). In the
case of homocedasticity we can apply alternative ANOVA, like Kruskal–Wallis one-way analysis of variance.
pairwise.t.test(fertilizer_data$production,fertilizer_data$fertilizer,p.adjust.method = "bonf")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: fertilizer_data$production and fertilizer_data$fertilizer
##
## fertilizer1 fertilizer2
## fertilizer2 0.0078 -
## fertilizer3 0.1099 0.8175
##
## P value adjustment method: bonferroni
11
Assess normality/homogeneity of variance using box plot of species diversity against zinc group. Test the
𝐻0 that the population group means are all equal and perform the analysis of variance of species diversity
versus zinc-level groups. What is your conclusion?
Perform post-hoc test to investigate pairwise mean differences between all groups.
Investigate the pairwise differences between all groups.
There are only difference in population means between HIGH and LOW Zinc groups?. Explain it.
Check normality and homocedasticity of the previsous model, and in the case of abnormalities propose a
solution.
𝜇𝑓𝑒𝑟𝑡𝑖𝑙𝑖𝑧𝑒𝑟1 − 𝜇𝑓𝑒𝑟𝑡𝑖𝑙𝑖𝑧𝑒𝑟2
𝜇𝑓𝑒𝑟𝑡𝑖𝑙𝑖𝑧𝑒𝑟1 − 𝜇𝑓𝑒𝑟𝑡𝑖𝑙𝑖𝑧𝑒𝑟3
𝜇𝑓𝑒𝑟𝑡𝑖𝑙𝑖𝑧𝑒𝑟2 − 𝜇𝑓𝑒𝑟𝑡𝑖𝑙𝑖𝑧𝑒𝑟3
Where 𝑦𝑖̄ is the mean of group 𝑖, 𝑡𝛼/2 is the t-value for a specific 1 − 𝛼 (usually 0.95) and SE is the
standard error of the difference of means. Remember to adjust the error type I for multiple testing and
12
to use the appropriate SE of the difference of means. (0.05/3/2=0.008) Be aware that 𝑆𝐸(𝑦𝑖̄ − 𝑦𝑗̄ ) =
√𝑀 𝑆𝑤𝑖𝑡ℎ𝑖𝑛 × ( 𝑛1𝑖 + 1
𝑛𝑗 )
Where 𝑛𝑖 and 𝑛𝑗 are the sample sizes of the respective groups 𝑖 and 𝑗.
Important! When calculating 𝑡𝛼/2 , the degrees of freedom are given by the ones of 𝑀 𝑆𝑤𝑖𝑡ℎ𝑖𝑛 which are
(𝑎𝑛 − 𝑎).
Can you calculate the CI for each of the pairwise comparisons? To get the t-value you can use the R function
qt(p=, df=), see the following example:
#qt(0.975,df=an-a)
t<-qt(1-0.05/2/3,df=10*3-3)
#where t is the value from the t distribution for � degrees of freedom (df=n*a -a) = (df=N-a)
# and 1- (alpha/2a) of confidence level
# a = number of groups
high_CI_f1f2<-(mean_f1-mean_f2)+(t*sqrt(MSW*(1/10+1/10)))
low_CI_f1f2<-(mean_f1-mean_f2)-(t*sqrt(MSW*(1/10+1/10)))
c(low_CI_f1f2,high_CI_f1f2)
high_CI_f1f3<-(mean_f1-mean_f3)+(t*sqrt(MSW*(1/10+1/10)))
low_CI_f1f3<-(mean_f1-mean_f3)-(t*sqrt(MSW*(1/10+1/10)))
c(low_CI_f1f3,high_CI_f1f3)
high_CI_f2f3<-(mean_f2-mean_f3)+(2.051831*sqrt(MSW*(1/10+1/10)))
low_CI_f2f3<-(mean_f2-mean_f3)-(2.051831*sqrt(MSW*(1/10+1/10)))
c(low_CI_f2f3,high_CI_f2f3)
13