2.ANOVA-solution - Solution Laboratory

2.
ONE WAY ANOVA (ANALYSIS OF VARIANCE) (SOLUTION)
B. Dobon, S. Walsh, H. Laayouni, T. Monleon
LABORATORY 2: 18-02-2021
Contents
1 Introduction to One-way Analysis of Variance 1
1.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 How can we manually calculate F? 4

2.1 Calculation of 𝑆𝑆𝑡𝑜𝑡𝑎𝑙 , 𝑆𝑆𝑏𝑒𝑡𝑤𝑒𝑒𝑛 and 𝑆𝑆𝑤𝑖𝑡ℎ𝑖𝑛 . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Degrees of freedom . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Calculation of 𝑀 𝑆𝑤𝑖𝑡ℎ𝑖𝑛 and 𝑀 𝑆𝑏𝑒𝑡𝑤𝑒𝑒𝑛 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.4 F ratio or F-statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.5 Calculate the significance of F-statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3 The ANOVA R function 7
4 Checking the model assumptions (normality, heterocedadticity,…) 8
5 Pairwise t-tests after a significant result 11
6 Exercise 1 - (OPTIONALLY!) Submit before 04/03/2021 11
7 Exercise 2 - (OPTIONALLY!) Submit before 04/03/2021 12
8 Exercise 3: Confidence intervals for all pairwise differences 12
1 Introduction to One-way Analysis of Variance
1.1 Definition
The one-way analysis of variance (ANOVA) is used to compare several groups of a quantitative variable.
Specifically, it is a way to calculate if there are any statistically significant differences between the means of
1
three or more independent groups. It tests the following null hypothesis:
𝐻0 ∶ 𝜇1 = 𝜇2 = 𝜇3 = ... = 𝜇𝑎
Where 𝜇 = group mean and a = number of groups.

In the case that the one-way ANOVA returns a significant result, the alternative hypothesis (𝐻𝐴 ) is accepted,
meaning that at least two group means are statistically different. However, the ANOVA statistic will not
tell you which are the significantly different groups, for that we will need a post hoc test.
1.2 Assumptions
In order to use the ANOVA test statistic, three main assumptions must be satisfied:
• The dependent variable is normally distributed in each group.

• There is homogeneity of variances, the population variances in each group are equal.
• Independence of observations.
These assumptions must be made after the ANOVA analysis. These assumptions should be made after the
ANOVA analysis, preferably using the errors of the model.
Figure 1: DESCOMPOSITION OF VARIABILITY
What is an ANOVA model?

Remember:
• There are two sources of variation between the N observations obtained in the experiment:
– the variation due to the treatments (factor levels)

– the experimental error
2
• The relative size of (1) and (2) is used to indicate whether the observed difference - The difference in
treatment is said to be real if the treatment variation is greater than the experimental error.
• A statistical model is a symbolic expression in the form of equality or equation that is used in all ex-
perimental designs (Ex. ANOVA, regression) to indicate the different factors that modify the response
variable. The model associated with the data can be written as:
𝑅𝑒𝑠𝑝𝑜𝑛𝑠𝑒(𝑌 ) = 𝐴.𝑓𝑎𝑐𝑡𝑜𝑟𝑐𝑎𝑢𝑠𝑒(𝐹 ) + 𝑒𝑟𝑟𝑜𝑟
𝑌𝑖𝑗 = 𝜇𝑖 + 𝜖𝑖𝑗
or
𝑌𝑖𝑗 = 𝜇 + 𝛼𝑖 + 𝜖𝑖𝑗
1.3 Example
Let’s work with the following simple example:
fertilizer1<-c(6.27,5.36,6.39,4.85,5.99,7.14,5.08,4.07,4.35,4.95)
fertilizer2<-c(3.07,3.29,4.04,4.19,3.41,3.75,4.87,3.94,6.28,3.15)
fertilizer3<-c(4.04,3.79,4.56,4.55,4.55,4.53,3.53,3.71,7.00,4.61)
data<-data.frame(fertilizer1,fertilizer2,fertilizer3)
data
## fertilizer1 fertilizer2 fertilizer3

## 1 6.27 3.07 4.04
## 2 5.36 3.29 3.79
## 3 6.39 4.04 4.56
## 4 4.85 4.19 4.55
## 5 5.99 3.41 4.55
## 6 7.14 3.75 4.53
## 7 5.08 4.87 3.53
## 8 4.07 3.94 3.71
## 9 4.35 6.28 7.00
## 10 4.95 3.15 4.61
The data represents the production of pumpkins of plots that received treatments with different fertilizers.
The question we ask ourselves here is: does the fertilizer have an effect on the pumpkin production?
To answer this question, we need to test the equal mean hypothesis and for that we have to calculate the
F-statistic.
Propose manually the model what this experiment represents:
Model proposed:
𝑌𝑖𝑗 = 𝜇 + 𝛼𝑖 + 𝜖𝑖𝑗
mu = mean of pumpkins production

i=1,2,3 fertilizers
j=1,2,…,n samples
3
Y=pumpkins production
effect of fertilizer1 =
𝛼𝑖
error, difference between response and model=

𝜖𝑖𝑗
2 How can we manually calculate F?
2.1 Calculation of 𝑆𝑆𝑡𝑜𝑡𝑎𝑙 , 𝑆𝑆𝑏𝑒𝑡𝑤𝑒𝑒𝑛 and 𝑆𝑆𝑤𝑖𝑡ℎ𝑖𝑛
First of all, we need to calculate 𝑆𝑆𝑡𝑜𝑡𝑎𝑙 , 𝑆𝑆𝑏𝑒𝑡𝑤𝑒𝑒𝑛 and 𝑆𝑆𝑤𝑖𝑡ℎ𝑖𝑛 . Use the following formula to get these
numbers.
𝑎𝑛 𝑎𝑛
𝑆𝑆𝑡𝑜𝑡𝑎𝑙 = ∑(𝑦𝑖𝑗 − 𝑦⋅⋅̄ )2 = ∑ 𝑦𝑖𝑗
2
− 𝑎𝑛𝑦⋅⋅2̄
1 1
𝑎𝑛 𝑎
𝑆𝑆𝑏𝑒𝑡𝑤𝑒𝑒𝑛 = ∑(𝑦𝑖⋅̄ − 𝑦⋅⋅̄ )2 = 𝑛 ∑ 𝑦𝑖⋅2̄ − 𝑎𝑛𝑦⋅⋅2̄
1 1
𝑆𝑆𝑤𝑖𝑡ℎ𝑖𝑛 = 𝑆𝑆𝑡𝑜𝑡𝑎𝑙 − 𝑆𝑆𝑏𝑒𝑡𝑤𝑒𝑒𝑛
What is the 𝑆𝑆𝑡𝑜𝑡𝑎𝑙 , 𝑆𝑆𝑏𝑒𝑡𝑤𝑒𝑒𝑛 and 𝑆𝑆𝑤𝑖𝑡ℎ𝑖𝑛 of the given example?
#First we do a preliminary calculations:

mean_f1<-mean(fertilizer1)
mean_all<-mean(c(mean_f1,mean_f2,mean_f3))
n1<-length(fertilizer1)
Now using the previous formula calculate the SST, SSB and SSW
# SS
SST<-sum(sum(fertilizer1^2),sum(fertilizer2^2),sum(fertilizer3^2)) - 3*n1*(mean_all^2)
SST
## [1] 36.4449
SSB<-(n1*sum((mean_f1^2),(mean_f2^2),(mean_f3^2))) - 3*n1*(mean_all^2)
SSB
## [1] 10.82275
4
SSW<-SST-SSB
SSW
## [1] 25.62215
2.2 Degrees of freedom

In order to fill the next column of Table 1 we should know that:
𝑑𝑓𝑏𝑒𝑡𝑤𝑒𝑒𝑛 = 𝑎 − 1
𝑑𝑓𝑤𝑖𝑡ℎ𝑖𝑛 = 𝑎𝑛 − 𝑎
𝑑𝑓𝑡𝑜𝑡𝑎𝑙 = 𝑎𝑛 − 1
Calculate the df’s and fill in Table 1 (see ending of file).
# degrees of freedom
df_SST<-(3*10)-1
df_SST
## [1] 29
df_SSB<-3-1
df_SSB
## [1] 2
df_SSW<-(3*10)-3
df_SSW
## [1] 27
2.3 Calculation of 𝑀 𝑆𝑤𝑖𝑡ℎ𝑖𝑛 and 𝑀 𝑆𝑏𝑒𝑡𝑤𝑒𝑒𝑛

It is quite simple to obtain 𝑀 𝑆𝑤𝑖𝑡ℎ𝑖𝑛 (Mean Square Within) and 𝑀 𝑆𝑏𝑒𝑡𝑤𝑒𝑒𝑛 (Mean Square Between), we
just need to divide 𝑆𝑆𝑤𝑖𝑡ℎ𝑖𝑛 and 𝑆𝑆𝑏𝑒𝑡𝑤𝑒𝑒𝑛 by their respective df.
𝑆𝑆𝑏𝑒𝑡𝑤𝑒𝑒𝑛
𝑀 𝑆𝑏𝑒𝑡𝑤𝑒𝑒𝑛 =
𝑑𝑓𝑏𝑒𝑡𝑤𝑒𝑒𝑛
𝑆𝑆𝑤𝑖𝑡ℎ𝑖𝑛
𝑀 𝑆𝑤𝑖𝑡ℎ𝑖𝑛 =
𝑑𝑓𝑤𝑖𝑡ℎ𝑖𝑛
Calculate 𝑀 𝑆𝑏𝑒𝑡𝑤𝑒𝑒𝑛 and 𝑀 𝑆𝑤𝑖𝑡ℎ𝑖𝑛 and fill in Table 1 (see ending of file).
MSB<-SSB/df_SSB
MSB
## [1] 5.411373
5
MSW<-SSW/df_SSW
MSW
## [1] 0.9489685
2.4 F ratio or F-statistic
You can think of the F-statistic as a value of how far we are from the null hypothesis, the bigger is F the
more relevant will be the effect of a fertilizer. The F-statistic is given by the following formula:
𝑀 𝑆𝑏𝑒𝑡𝑤𝑒𝑒𝑛
𝐹 =
𝑀 𝑆𝑤𝑖𝑡ℎ𝑖𝑛
Complete Table 2 by calculating F.
F<-MSB/MSW
F #is the F statistic
## [1] 5.702374
2.5 Calculate the significance of F-statistic
Once the F-statistic is calculated, we need to know if it is a significant result or not. Assuming the equal
mean hypothesis, we can know the probability of obtaining a certain value of F or higher.
#EXAMPLE, imagine this invented situation:

a<-3
n<-10
#then
1-pf(F,df1=a-1,df2=a*n-a)
If you choose a critical value of e.g. 𝛼=0.05 and obtain a p-value lower than 0.05, you will reject the null
hypothesis of mean equality and conclude that not all compared means are equal. We will not know which
of the group means is different, we just know that not all compared means are equal. In order to know which
of the group means differs we will use a post-hoc test. This will be explained in the next section.
Which p-value did you obtain? What can you conclude?
#Now the calculation with F statistic obtained previously

p_val<-1-pf(F,df1=3-1,df2=3*10-3)
p_val
## [1] 0.008594377
## p_val < 0.05, we reject the null hypothesis, the group means are different.
6
3 The ANOVA R function
So far we’ve tried to understand how the one-factor analysis of the variance works and how can we calculate
it manually. From now on, you can use an R function in order to be faster when you analyse your own data!
First, we need a special format of the data to work with R, it is called long-format data. We will use the
library “tidyr” to re-format our data:
library("tidyr")
fertilizer_data<-gather(data,fertilizer,production,fertilizer1:fertilizer3)
fertilizer_data
## fertilizer production
## 1 fertilizer1 6.27
Once we have the good format of the data, we will use the R function 𝑎𝑜𝑣(). Let’s see if we completed the
ANOVA summary table correctly. Be inspired by the following example:
#EXAMPLE:
# anova.res<-aov(Y~factor,data = data.frame.name)
#
anova_out<-aov(production~fertilizer,fertilizer_data)
# production represents the column with numeric variables
7
# fertilizer represents the factors
# fertilizer_data is the name of the data frame where you will find the columns
# named production fertilizer
anova_out
## Call:
## aov(formula = production ~ fertilizer, data = fertilizer_data)
##
## Terms:
## fertilizer Residuals
## Sum of Squares 10.82275 25.62215
## Deg. of Freedom 2 27
##
## Residual standard error: 0.9741502
## Estimated effects may be unbalanced
The output of 𝑎𝑜𝑣() function is a table where you will find in the first row: the 𝑆𝑆𝑏𝑒𝑡𝑤𝑒𝑒𝑛 (first column)
and 𝑆𝑆𝑤𝑖𝑡ℎ𝑖𝑛 (also called Residuals, second column). In the second row you’ll find the degrees of freedom of
𝑆𝑆𝑏𝑒𝑡𝑤𝑒𝑒𝑛 (a-1) and 𝑆𝑆𝑤𝑖𝑡ℎ𝑖𝑛 (an-a). You probably noticed that no F-statistic nor p-value are given in the
output. We need to use the function 𝑠𝑢𝑚𝑚𝑎𝑟𝑦() to get both.
summary(anova_out)
## Df Sum Sq Mean Sq F value Pr(>F)

## fertilizer 2 10.82 5.411 5.702 0.00859 **
## Residuals 27 25.62 0.949
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Did you obtain the same result as your manual calculation of F?
# YES THE SAME RESULT
4 Checking the model assumptions (normality, heterocedadticity,…)

Represent the experimental results:
library("ggpubr")
ggline(fertilizer_data, x = "fertilizer", y = "production",
add = c("mean_se", "jitter"),
order = c("fertilizer1", "fertilizer2", "fertilizer3"),
ylab = "production", xlab = "fertilizer")
8
7
6
production
fertilizer1 fertilizer2 fertilizer3

fertilizer
Once a significant result is obtained with the one-way ANOVA, a verification of the assumptions of the model
must be made as a diagnosis and correction of the model, since the conclusions we reach may be incorrect.
There are two ways to do it, graphically and through hypothesis testing and and we will always work with
model errors: Error(model)
A better graphical way to look at data normality is to perform a QQ plot. A histogram shows the frequencies
of different values in the variable (counts). Depending on how the histogram looks it can be misleading. It’s
better to use the QQ plot. A Q-Q plot shows the mapping between the distribution of the data and the ideal
distribution (the normal distribution in this case). Usually a line is plotted through the quartiles. When the
dots follow the line closely, the data has a normal distribution.
In this case we can use the object Error(anova_out) obtained previosly, that contains the model errors
result:
#errors: Yi - model_i
residuals(anova_out)
## 1 2 3 4 5 6 7 8 9 10 11
## 0.825 -0.085 0.945 -0.595 0.545 1.695 -0.365 -1.375 -1.095 -0.495 -0.929
## 12 13 14 15 16 17 18 19 20 21 22
## -0.709 0.041 0.191 -0.589 -0.249 0.871 -0.059 2.281 -0.849 -0.447 -0.697
## 23 24 25 26 27 28 29 30
## 0.073 0.063 0.063 0.043 -0.957 -0.777 2.513 0.123
# make a QQ plot
qqnorm(residuals(anova_out), main = "QQ plot")
# add a QQ line
qqline(residuals(anova_out), col=2)
9
QQ plot
2
1
Sample Quantiles
0
−1
−2 −1 0 1 2
Theoretical Quantiles
The graphical methods for checking data normality in R still leave much to your own interpretation. There’s
much discussion in the statistical world about the meaning of these plots and what can be seen as normal.
If you show any of these plots to ten different statisticians, you can get ten different answers. That’s quite
an achievement when you expect a simple yes or no, but statisticians don’t do simple answers.
On the contrary, everything in statistics revolves around measuring uncertainty. This uncertainty is sum-
marized in a probability — often called a p-value — and to calculate this probability, you need a formal
test.
Probably the most widely used test for normality is the Kolmogorov-Smirnov, other is the Shapiro-Wilks
test. The function to perform this test, is called ks.test().
Using the statistical test called Kolmogorov-Smirnov check if the errors are normally distributed.
ks.test(residuals(anova_out),"pnorm", mean(residuals(anova_out)), sd(residuals(anova_out)))
##
## One-sample Kolmogorov-Smirnov test
##
## data: residuals(anova_out)
## D = 0.18616, p-value = 0.2496
## alternative hypothesis: two-sided
If the variances are not similar between the groups and the variable is not normally distributed, we cannot
use ANOVA directly and it can affect the conclusions obtained. Are the variances equal between factor levels
(homocesdacity)?
To check this we can apply Bartlett test.
10
bartlett.test(response ~ factor, data1)
bartlett.test(production~fertilizer,fertilizer_data)
##
## Bartlett test of homogeneity of variances
##
## data: production by fertilizer
## Bartlett's K-squared = 0.00017048, df = 2, p-value = 0.9999
Remember what to do when errors-model are not normal and / or homocedastic?: To do it we try to
transform the data to fit it into a normal distribution. We will apply the log10 transformation to our data.
Use the log10() function. Calculate the ANOVA of the newly transformed data (OPTIONALLY !). In the
case of homocedasticity we can apply alternative ANOVA, like Kruskal–Wallis one-way analysis of variance.
5 Pairwise t-tests after a significant result

Once a significant result is obtained with the one-way ANOVA, we can only say that at least one population
mean is different from at least one other population mean. Now, we will learn how to know which is the
population mean that is different from the others.
For that we will use pairwise t-tests with the Bonferroni p-adjustment. The following R command is used:
pairwise.t.test().
Search the arguments of the function and do the calculation in order to compute correctly Pairwise t-tests
to our example:
pairwise.t.test(fertilizer_data$production,fertilizer_data$fertilizer,p.adjust.method = "bonf")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: fertilizer_data$production and fertilizer_data$fertilizer
##
## fertilizer1 fertilizer2
## fertilizer2 0.0078 -
## fertilizer3 0.1099 0.8175
##
## P value adjustment method: bonferroni
What can we conclude after the pairwise t-test?
6 Exercise 1 - (OPTIONALLY!) Submit before 04/03/2021

Medley and Clements (1998) investigated the impact of zinc contamination (and other heavy metals) on the
diversity of diatom species in the USA Rocky Mountains. The diversity of diatoms (number of species) and
degree of zinc contamination (categorized as either of high, medium, low or natural background level) were
recorded from between four and six sampling stations within each of six streams known to be polluted. These
data were used to test the null hypothesis that there were no differences the diversity of diatoms between
different zinc levels. You can find the data in the contamination.txt file.
11
Assess normality/homogeneity of variance using box plot of species diversity against zinc group. Test the
𝐻0 that the population group means are all equal and perform the analysis of variance of species diversity
versus zinc-level groups. What is your conclusion?
Perform post-hoc test to investigate pairwise mean differences between all groups.
Investigate the pairwise differences between all groups.
There are only difference in population means between HIGH and LOW Zinc groups?. Explain it.
Check normality and homocedasticity of the previsous model, and in the case of abnormalities propose a
solution.
7 Exercise 2 - (OPTIONALLY!) Submit before 04/03/2021

Set a normal distribution with parameters 𝜇 = 168 and 𝜎2 = 20 and generate 4 samples of size 20. For each
sample, calculate and store the mean and the variance. Using the aov() function, compare the 𝑀 𝑆𝑏𝑒𝑡𝑤𝑒𝑒𝑛
and the 𝑀 𝑆𝑤𝑖𝑡ℎ𝑖𝑛 between the 4 samples you just generated.
Please, generate 4 samples and calculate their respective mean and variance:
Now the ANOVA calculation the 𝑀 𝑆𝑏𝑒𝑡𝑤𝑒𝑒𝑛 and the 𝑀 𝑆𝑤𝑖𝑡ℎ𝑖𝑛 :
Compare the 2 values of 𝑀 𝑆𝑏𝑒𝑡𝑤𝑒𝑒𝑛 and 𝑀 𝑆𝑤𝑖𝑡ℎ𝑖𝑛 , are both accepted estimates of the variance of the normal
distribution you generated? If this is the case, which of them would be a the best estimator?
Both are accepted estimates of the variance but 𝑀 𝑆𝑤𝑖𝑡ℎ𝑖𝑛 is theoretically the best estimator since it has the
highest degrees of freedom.
Replicate the previous experiment 1000 times (sampling 4 samples of size 20 and do the ANOVA calculation)
and plot the distribution of F-values. If you were to use this distribution to estimate the p-values of an
ANOVA test with same df’s, what will be your F critical value at 𝛼 = 0.05? Compare this value to the one
of a theoretical F distribution using qf().
8 Exercise 3: Confidence intervals for all pairwise differences

Following our working example of the production of plots in relation to different fertilizers, you have to
calculate the confidence intervals (CI) for all the possible mean pairwise differences. Which is simply calculate
the CI of the difference of two population means (all possible combinations). In the case of the fertilizers, it
would be to calculate the CI of:
𝜇𝑓𝑒𝑟𝑡𝑖𝑙𝑖𝑧𝑒𝑟1 − 𝜇𝑓𝑒𝑟𝑡𝑖𝑙𝑖𝑧𝑒𝑟2
A 1 − 𝛼 CI for a difference between two population means is given by the formula:
𝑦𝑖̄ − 𝑦𝑗̄ ± 𝑡𝛼/2 × 𝑆𝐸(𝑦𝑖̄ − 𝑦𝑗̄ )
Where 𝑦𝑖̄ is the mean of group 𝑖, 𝑡𝛼/2 is the t-value for a specific 1 − 𝛼 (usually 0.95) and SE is the
standard error of the difference of means. Remember to adjust the error type I for multiple testing and
12
to use the appropriate SE of the difference of means. (0.05/3/2=0.008) Be aware that 𝑆𝐸(𝑦𝑖̄ − 𝑦𝑗̄ ) =
√𝑀 𝑆𝑤𝑖𝑡ℎ𝑖𝑛 × ( 𝑛1𝑖 + 1
𝑛𝑗 )
Where 𝑛𝑖 and 𝑛𝑗 are the sample sizes of the respective groups 𝑖 and 𝑗.
Important! When calculating 𝑡𝛼/2 , the degrees of freedom are given by the ones of 𝑀 𝑆𝑤𝑖𝑡ℎ𝑖𝑛 which are
(𝑎𝑛 − 𝑎).
Can you calculate the CI for each of the pairwise comparisons? To get the t-value you can use the R function
qt(p=, df=), see the following example:
#qt(0.975,df=an-a)
t<-qt(1-0.05/2/3,df=10*3-3)
#where t is the value from the t distribution for � degrees of freedom (df=n*a -a) = (df=N-a)
# and 1- (alpha/2a) of confidence level
# a = number of groups
Calculate the CI of all pairwise comparisons.
high_CI_f1f2<-(mean_f1-mean_f2)+(t*sqrt(MSW*(1/10+1/10)))
low_CI_f1f2<-(mean_f1-mean_f2)-(t*sqrt(MSW*(1/10+1/10)))
c(low_CI_f1f2,high_CI_f1f2)
## [1] 0.3340132 2.5579868
high_CI_f1f3<-(mean_f1-mean_f3)+(t*sqrt(MSW*(1/10+1/10)))
low_CI_f1f3<-(mean_f1-mean_f3)-(t*sqrt(MSW*(1/10+1/10)))
## [1] -0.1539868 2.0699868
high_CI_f2f3<-(mean_f2-mean_f3)+(2.051831*sqrt(MSW*(1/10+1/10)))
low_CI_f2f3<-(mean_f2-mean_f3)-(2.051831*sqrt(MSW*(1/10+1/10)))
## [1] -1.3818867 0.4058867
What does it mean? Is related to ANOVA done before?
#SAME SOLUTION RELATED WITH PREVIOUS RESULTS
13

2.ANOVA-solution - Solution Laboratory

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2.ANOVA-solution - Solution Laboratory

Uploaded by

Copyright:

Available Formats

2.

ONE WAY ANOVA (ANALYSIS OF VARIANCE) (SOLUTION)

B. Dobon, S. Walsh, H. Laayouni, T. Monleon

2 How can we manually calculate F? 4

3 The ANOVA R function 7

4 Checking the model assumptions (normality, heterocedadticity,…) 8

5 Pairwise t-tests after a significant result 11

6 Exercise 1 - (OPTIONALLY!) Submit before 04/03/2021 11

7 Exercise 2 - (OPTIONALLY!) Submit before 04/03/2021 12

8 Exercise 3: Confidence intervals for all pairwise differences 12

1 Introduction to One-way Analysis of Variance

Where 𝜇 = group mean and a = number of groups.

• The dependent variable is normally distributed in each group.

Figure 1: DESCOMPOSITION OF VARIABILITY

What is an ANOVA model?

– the variation due to the treatments (factor levels)

𝑅𝑒𝑠𝑝𝑜𝑛𝑠𝑒(𝑌 ) = 𝐴.𝑓𝑎𝑐𝑡𝑜𝑟𝑐𝑎𝑢𝑠𝑒(𝐹 ) + 𝑒𝑟𝑟𝑜𝑟

Let’s work with the following simple example:

## fertilizer1 fertilizer2 fertilizer3

mu = mean of pumpkins production

error, difference between response and model=

2 How can we manually calculate F?

2.1 Calculation of 𝑆𝑆𝑡𝑜𝑡𝑎𝑙 , 𝑆𝑆𝑏𝑒𝑡𝑤𝑒𝑒𝑛 and 𝑆𝑆𝑤𝑖𝑡ℎ𝑖𝑛

𝑆𝑆𝑤𝑖𝑡ℎ𝑖𝑛 = 𝑆𝑆𝑡𝑜𝑡𝑎𝑙 − 𝑆𝑆𝑏𝑒𝑡𝑤𝑒𝑒𝑛

What is the 𝑆𝑆𝑡𝑜𝑡𝑎𝑙 , 𝑆𝑆𝑏𝑒𝑡𝑤𝑒𝑒𝑛 and 𝑆𝑆𝑤𝑖𝑡ℎ𝑖𝑛 of the given example?

#First we do a preliminary calculations:

2.2 Degrees of freedom

Calculate the df’s and fill in Table 1 (see ending of file).

2.3 Calculation of 𝑀 𝑆𝑤𝑖𝑡ℎ𝑖𝑛 and 𝑀 𝑆𝑏𝑒𝑡𝑤𝑒𝑒𝑛

2.4 F ratio or F-statistic

Complete Table 2 by calculating F.

2.5 Calculate the significance of F-statistic

#EXAMPLE, imagine this invented situation:

#Now the calculation with F statistic obtained previously

## Df Sum Sq Mean Sq F value Pr(>F)

Did you obtain the same result as your manual calculation of F?

# YES THE SAME RESULT

4 Checking the model assumptions (normality, heterocedadticity,…)

fertilizer1 fertilizer2 fertilizer3

ks.test(residuals(anova_out),"pnorm", mean(residuals(anova_out)), sd(residuals(anova_out)))

5 Pairwise t-tests after a significant result

What can we conclude after the pairwise t-test?

6 Exercise 1 - (OPTIONALLY!) Submit before 04/03/2021

7 Exercise 2 - (OPTIONALLY!) Submit before 04/03/2021

8 Exercise 3: Confidence intervals for all pairwise differences

A 1 − 𝛼 CI for a difference between two population means is given by the formula:

𝑦𝑖̄ − 𝑦𝑗̄ ± 𝑡𝛼/2 × 𝑆𝐸(𝑦𝑖̄ − 𝑦𝑗̄ )

Calculate the CI of all pairwise comparisons.

## [1] 0.3340132 2.5579868

## [1] -0.1539868 2.0699868

## [1] -1.3818867 0.4058867

What does it mean? Is related to ANOVA done before?

#SAME SOLUTION RELATED WITH PREVIOUS RESULTS

You might also like